wjdp / htmltest

:white_check_mark: Test generated HTML for problems
MIT License
327 stars 52 forks source link

Some hosts return 404/503/non-200 when links are checked #165

Open hellt opened 3 years ago

hellt commented 3 years ago

Describe the bug

Checks of external links to media resources hosted on twitter, such as https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg report 404, although curl has not issues with that:

❯ curl -vL https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5\?format\=jpg > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 93.184.220.70...
* TCP_NODELAY set
* Connected to pbs.twimg.com (93.184.220.70) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [227 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [98 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2937 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Twitter, Inc.; CN=*.twimg.com
*  start date: Nov  5 00:00:00 2020 GMT
*  expire date: Nov  9 23:59:59 2021 GMT
*  subjectAltName: host "pbs.twimg.com" matched cert's "*.twimg.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS RSA SHA256 2020 CA1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fce8000aa00)
> GET /media/EuF4GgyXUAEZ3j5?format=jpg HTTP/2
> Host: pbs.twimg.com
> User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
> Accept: */*
> Referer:
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200
< accept-ranges: bytes
< access-control-allow-origin: *
< access-control-expose-headers: Content-Length
< age: 106
< cache-control: max-age=604800, must-revalidate
< content-type: image/jpeg
< date: Mon, 19 Apr 2021 09:05:11 GMT
< last-modified: Sat, 13 Feb 2021 08:04:38 GMT
< server: ECS (amb/6B85)
< strict-transport-security: max-age=631138519
< surrogate-key: media media/bucket/4 media/1360500615718326273
< timing-allow-origin: https://twitter.com, https://mobile.twitter.com
< x-cache: HIT
< x-connection-hash: 1223340481ffa7d392cf6199e5d2bd1f
< x-content-type-options: nosniff
< x-response-time: 238
< x-tw-cdn: VZ
< content-length: 79258
<
{ [16383 bytes data]
100 79258  100 79258    0     0   814k      0 --:--:-- --:--:-- --:--:--  814k
* Connection #0 to host pbs.twimg.com left intact
* Closing connection 0

Here is the error from htmltest

Non-OK status: 404 --- 2021/transparently-redirecting-packets/frames-between-interfaces/index.html --> https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg

To Reproduce

Steps to reproduce the behaviour:

  1. embed a link to a twitter hosted media resource, for example https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg

.htmltest.yml

bare config

Expected behaviour

An error is not reported since the resource is available.

Actual behaviour

404 is returned

jtopper commented 2 years ago

I've found something similar. I believe it's because this service is fronted by CloudFlare which, not recognising the source of the request, serves up a CAPTCHA page with a 403 instead of the resource. I guess the fix would be to manipulate the requests htmltest makes so that it looks more like a real browser, but that seems non-trivial.

wjdp commented 2 years ago

I've done some testing on URLs here using htmltest unchanged and configured with a curl user agent and the range header we add removed. No change to behaviour from upstream hosts.

URL Status (htmltest) Status (htmltest as curl)
https://www.php.net/manual/en/book.pcntl.php 200 200
https://play.google.com/store/apps/details?id=com.azure.authenticator&hl=en&gl=US 404 404
https://old.reddit.com/r/golang/comments/teu78z/118_is_released/ 200 200
https://www.reddit.com/r/golang/comments/teu78z/118_is_released/ 200 200
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg 404 404
arranf commented 2 years ago

I can provide exact examples that work fine with curl but don't succeed with htmltest. This is reliably reproducible. What kind of logs/output would help you verify?

wjdp commented 2 years ago

@arranf Just a list of urls you've found problematic. I've not pushed the branch but have been adding these as a unit test to help track. I'm then planning on tweaking request params (as above trying to pretend to be curl) to try and identify what's causing these to be blocked.

I doubt we'll have this completely fixed for all hosts but am hoping for an improvement.

arranf commented 2 years ago
- ^https?://(www\.)?play\.google\.com\b # Always fails with htmltest
- ^https?://(www\.)?crates\.io\b # Always fails with htmltest
- ^https?://(www\.)?lastpass\.com\b # Always fails with htmltest
- ^https?://help\.elgato\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://uk\.pcpartpicker\.com\b # Always fails with htmltest
- ^https?://(www\.)?corsair\.com\b # Always fails with htmltest
- 'https://docs.github.com/en/get-started/using-git/about-git-rebase' # Not sure why this is 403ing
- ^https?://(www\.)?reddit\.com\b # Always fails with htmltest

This is a list copied from my .htmltest.yml

theory commented 2 years ago

I found a couple more:

istr commented 2 years ago

And also https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 fine for curl, 500 for htmltest. (but this is due to StripQueryString defaulting to true and I would doubt that this default is the best choice here...)