Open hellt opened 3 years ago
I've found something similar. I believe it's because this service is fronted by CloudFlare which, not recognising the source of the request, serves up a CAPTCHA page with a 403 instead of the resource. I guess the fix would be to manipulate the requests htmltest makes so that it looks more like a real browser, but that seems non-trivial.
I've done some testing on URLs here using htmltest unchanged and configured with a curl user agent and the range header we add removed. No change to behaviour from upstream hosts.
URL | Status (htmltest) | Status (htmltest as curl) |
---|---|---|
https://www.php.net/manual/en/book.pcntl.php | 200 | 200 |
https://play.google.com/store/apps/details?id=com.azure.authenticator&hl=en&gl=US | 404 | 404 |
https://old.reddit.com/r/golang/comments/teu78z/118_is_released/ | 200 | 200 |
https://www.reddit.com/r/golang/comments/teu78z/118_is_released/ | 200 | 200 |
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg | 404 | 404 |
I can provide exact examples that work fine with curl but don't succeed with htmltest. This is reliably reproducible. What kind of logs/output would help you verify?
@arranf Just a list of urls you've found problematic. I've not pushed the branch but have been adding these as a unit test to help track. I'm then planning on tweaking request params (as above trying to pretend to be curl) to try and identify what's causing these to be blocked.
I doubt we'll have this completely fixed for all hosts but am hoping for an improvement.
- ^https?://(www\.)?play\.google\.com\b # Always fails with htmltest
- ^https?://(www\.)?crates\.io\b # Always fails with htmltest
- ^https?://(www\.)?lastpass\.com\b # Always fails with htmltest
- ^https?://help\.elgato\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://(www\.)?corsair\.com\b # Always fails with htmltest
- 'https://docs.github.com/en/get-started/using-git/about-git-rebase' # Not sure why this is 403ing
- ^https?://(www\.)?reddit\.com\b # Always fails with htmltest
This is a list copied from my .htmltest.yml
I found a couple more:
transunion.com
returns a 403. Also does in Curl but changes to a 301 if I give it a valid user agenthttps://bookshop.org/books/project-hail-mary/9780593135204
returns a 403 to htmltest but a 200 to curlAnd also https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 fine for curl, 500 for htmltest.
(but this is due to StripQueryString
defaulting to true
and I would doubt that this default is the best choice here...)
Describe the bug
Checks of external links to media resources hosted on twitter, such as
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg
report 404, although curl has not issues with that:Here is the error from htmltest
To Reproduce
Steps to reproduce the behaviour:
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg
.htmltest.yml
bare config
Expected behaviour
An error is not reported since the resource is available.
Actual behaviour
404 is returned