zenhack / ttrss-sandstorm

Sandstorm port of Tiny Tiny RSS
GNU General Public License v3.0
6 stars 2 forks source link

Cloudflare's blog feed gives 500 server error #29

Closed garrison closed 2 years ago

garrison commented 2 years ago

I noticed I hadn't heard anything from the Cloudflare blog in the last month or so. It turns out that the feed began responding recently with a 500 server error.

[15:05:16/70] start
[15:05:16/70] running HOOK_FETCH_FEED handlers...
[15:05:16/70] feed data has not been modified by a plugin.
[15:05:16/70] local cache will not be used for this feed
[15:05:16/70] last unconditional update request: 2021-11-16 23:17:08
[15:05:16/70] maximum allowed interval for conditional requests exceeded, forcing refetch
[15:05:16/70] fetching [https://blog.cloudflare.com/rss/] (force_refetch: 1)...
[15:05:16/70] fetch done.
[15:05:16/70] effective URL (after redirects): https://blog.cloudflare.com/rss/ (IP: blog.cloudflare.com)
[15:05:16/70] source last modified: 
[15:05:16/70] unable to fetch: HTTP/1.1 500 Internal Server Error [500]

I managed to reproduce this in a brand new grain, but I have not tried fetching from a different Sandstorm instance.

garrison commented 2 years ago

I tried and was unable to reproduce this on a local instance of Sandstorm. I wonder if my main Sandstorm server might have landed on a denylist, somehow. If so, 500 seems like an odd error code for that case, at least to me. I'll see if I can reproduce outside of Sandstorm from that server.

ocdtrekkie commented 2 years ago

I just replicated this, if you want to reopen it. Interesting that "feed behind Cloudflare" is specifically cited as the reason it doesn't work in the UI.

image

ocdtrekkie commented 2 years ago

This might be an interesting case of an issue @zenhack was looking at already.

https://blog.cloudflare.com/ returns the error above. https://blog.cloudflare.com/rss/ works fine for me.

garrison commented 2 years ago

It worked for me just now locally using https://blog.cloudflare.com/rss/, which is why I closed this issue.

But when I attempt the same URL (https://blog.cloudflare.com/rss/) within a brand new TTRSS grain on my main Sandstorm server, I again get the same error as you did:

Couldn't download the specified URL: HTTP/1.1 500 Internal Server Error (feed behind Cloudflare)

ocdtrekkie commented 2 years ago

So the opinion I am seeing on the TTRSS community, is that this happens when someone implements some form of Cloudflare bot protection, and fails to account for the legitimate use of a "bot" to access RSS feeds.

In #26, @zenhack noticed that sandstorm.io's RSS feed did not have this issue for him, but capnproto.org's did. I found that quite strange since the actual code on those sites for RSS feeds is pretty identical. I am curious if @kentonv could enlighten us to the Cloudflare settings though, and whether or not that would explain the strange behavior.

garrison commented 2 years ago

When I run wget or curl from my main Sandstorm server, I got a 503 error code, along with the Checking your browser before accessing blog.cloudflare.com page. That makes a lot more sense to me; I am not sure why the ttrss UI is claiming it's a 500.

$ wget https://blog.cloudflare.com/rss/
--2021-12-30 17:21:52--  https://blog.cloudflare.com/rss/
Resolving blog.cloudflare.com (blog.cloudflare.com)... 2606:4700::6812:1a2e, 2606:4700::6812:1b2e, 104.18.26.46, ...
Connecting to blog.cloudflare.com (blog.cloudflare.com)|2606:4700::6812:1a2e|:443... connected.
HTTP request sent, awaiting response... 503 Service Temporarily Unavailable
2021-12-30 17:21:52 ERROR 503: Service Temporarily Unavailable.

So it looks like there are two different issues.

  1. Cloudflare's automatic determination that my server is a bot, resulting in denied access to the RSS feed.
  2. @ocdtrekkie's independent finding that https://blog.cloudflare.com/rss/ works if typed into TTRSS, while https://blog.cloudflare.com/ fails. I am pretty sure I noticed this myself, too, a few months ago.
ocdtrekkie commented 2 years ago

I got the "feed behind Cloudflare" error when not including rss/, so my guess is that it failed to pull the main blog page, and hence, was also unable to find the link tag to rss/.

zenhack commented 2 years ago

Is there anything interesting in the Sandstorm system log? I'm wondering if some of the information about the error is getting lost between Sandstorm itself and the grain.

ocdtrekkie commented 2 years ago

I don't see anything interesting in the system log, grain log, or TTRSS log.

zenhack commented 2 years ago

Hm, so it seems like the underlying issue with cloudflare is something we probably can't fix.

The 503 -> 500 is I think an artifact of plumbing the response through web session; differing 5xx status codes are something it doesn't actually support. So this is a sandstorm thing.

So is there anything actionable here, or should we just close this?

garrison commented 2 years ago

So is there anything actionable here, or should we just close this?

It would be nice to understand why typing https://blog.cloudflare.com/ into the UI results in it reporting a 500 error, while https://blog.cloudflare.com/rss/ succeeds.

That's the only actionable item I can think of at this point.

ocdtrekkie commented 2 years ago

My guess is that they knew to exempt their RSS feed from bot protection, but failed to realize that people might use the main URL to find the RSS URL.

garrison commented 2 years ago

My guess is that they knew to exempt their RSS feed from bot protection, but failed to realize that people might use the main URL to find the RSS URL.

I think you're right.

In the case of my server, Cloudflare seems to have blocked all traffic (presumably based on its IP address). But if I run curl or wget from any other machine, I get a 503 response at https://blog.cloudflare.com/ even though https://blog.cloudflare.com/rss/ returns a 200. So it seems that RSS is generally (but not always) exempt from bot detection.

Closing, since I can think of no other actionable item.