Closed frerich closed 3 years ago
Another observation: htmltest sometimes reports HTTP 403 for pages for which cURL reports HTTP 302 (a redirect) followed by HTTP 404. Here's the cURL output printing just the headers (and following redirects) for one such URL, https://www.microsoft.com/en-us/download/details.aspx?id=26607 :
➜ curl -LI https://www.microsoft.com/en-us/download/details.aspx\?id\=26607
HTTP/2 302
content-type: text/html
location: https://www.microsoft.com/en-us/download/404Error.aspx
access-control-allow-headers: Origin, X-Requested-With, Content-Type, Accept
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
access-control-allow-credentials: true
p3p: CP="ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI"
x-frame-options: SAMEORIGIN
expires: Tue, 18 Aug 2020 06:58:36 GMT
cache-control: max-age=0, no-cache, no-store
pragma: no-cache
date: Tue, 18 Aug 2020 06:58:36 GMT
set-cookie: MS-CV=2UsWOH1ZwkiollHO.1; domain=.microsoft.com; expires=Wed, 19-Aug-2020 06:58:33 GMT; path=/;samesite=None
tls_version: tls1.2
strict-transport-security: max-age=31536000
x-rtag: StMus
HTTP/2 404
cache-control: private
content-length: 85792
content-type: text/html
correlationvector: OFPw0nMMUUGpq+f0.1.0
access-control-allow-headers: Origin, X-Requested-With, Content-Type, Accept
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
access-control-allow-credentials: true
p3p: CP="ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI"
x-frame-options: SAMEORIGIN
date: Tue, 18 Aug 2020 06:58:36 GMT
set-cookie: MS-CV=OFPw0nMMUUGpq+f0.1; domain=.microsoft.com; expires=Wed, 19-Aug-2020 06:58:36 GMT; path=/;samesite=None
tls_version: tls1.2
strict-transport-security: max-age=31536000
x-rtag: StMus
The same URL, when used in a HTML page, makes the htmltest link checker report a HTTP 403.
In case anybody else is fighting with this: a workaround is to add
IgnoreExternalBrokenLinks: true
to the .htmltest.yml
configuration file.
I think this is because of the user agent: a plain HTTPie request with the user agent "htmltest/123" yields a 403, indeed.
I found a better workaround than ignoring external links: overriding the default user agent by setting a new one in the .htmltest.yml
file:
HTTPHeaders: {"User-Agent": "Bacon/123"}
Hey @frerich, thanks for the writeup! Sorry for not replying to this sooner. It seems that some sites either explicitly or by accident block the htmltest/*
user agent.
I've poked a bit and found any user agent starting with htmltest
is blocked on some microsoft.com pages. Setting your own UA is pretty much the only solution here!
As there's not much we can do here gonna close.
When feeding tiny HTML page with a few links to Microsoft MSDN pages to htmltest, error messages are printed since htmltest sees HTTP 403 responses. However, feeding the exact same URLs to cURL or httpie (or my browser) gives a HTTP 200 response. I'm running htmltest on thousands of web pages (containing a lot of links) -- it always only microsoft.com URLs which trigger this behaviour.
To Reproduce:
Store this HTML markup which just has a few links to Microsoft pages in
test.html
:Run
htmltest
without any arguments on the file:.htmltest.yml
There is no configuration file used in this case.
Expected behaviour
I expect htmltest to not print any errors
Actual behaviour
htmltest prints the following error messages:
Versions
Additional context
Issueing a 'HEAD' or 'GET' request using the 'httpie' utility (or cURL) works for any of the URLs in the document, resulting in a HTTP 200 response, e.g.: