Pavel Korytov :emacs:☮️ on Nostr: I've tried making a full-text #RSS feed for the websites of ScienceX, the parent org ...
I've tried making a full-text #RSS feed for the websites of ScienceX, the parent org for Phys.org and Tech Xplore.
The webpages are very straightforward, so the bridge (for #rssbridge) took just about 200 LoC. But! #CloudFlare is super zealous there.
Even with the following parameters:
- 3 feeds
- fetch every hour
- cache webpages for 7 days (= fetch each webpage only once, for all intents and purposes)
I already got 429'ed. I'll try fetching every 4 hours, I guess...
W-why such extreme measures to prevent parsing? I'm sure #AI corps or whoever needs their data will just hire a bunch of people to solve CloudFlare's CAPTCHAs, but everyone else will be left behind.
Just give me the damn full-text RSS, I'd even pay for it... if I could sign up, the signup form returns 503 for me.
The webpages are very straightforward, so the bridge (for #rssbridge) took just about 200 LoC. But! #CloudFlare is super zealous there.
Even with the following parameters:
- 3 feeds
- fetch every hour
- cache webpages for 7 days (= fetch each webpage only once, for all intents and purposes)
I already got 429'ed. I'll try fetching every 4 hours, I guess...
W-why such extreme measures to prevent parsing? I'm sure #AI corps or whoever needs their data will just hire a bunch of people to solve CloudFlare's CAPTCHAs, but everyone else will be left behind.
Just give me the damn full-text RSS, I'd even pay for it... if I could sign up, the signup form returns 503 for me.