wjdp / htmltest

:white_check_mark: Test generated HTML for problems
MIT License
323 stars 54 forks source link

htmltest's link checker reports HTTP 403 but cURL (and httpie) report 200 #150

Closed frerich closed 3 years ago

frerich commented 4 years ago

When feeding tiny HTML page with a few links to Microsoft MSDN pages to htmltest, error messages are printed since htmltest sees HTTP 403 responses. However, feeding the exact same URLs to cURL or httpie (or my browser) gives a HTTP 200 response. I'm running htmltest on thousands of web pages (containing a lot of links) -- it always only microsoft.com URLs which trigger this behaviour.

To Reproduce:

  1. Store this HTML markup which just has a few links to Microsoft pages in test.html:

    <!doctype html>
    <html>
    <body>
        <a href="https://www.microsoft.com/en-us/download/details.aspx?id=13255">Microsoft Access Database Engine 2010 Redistributable</a>
        <a href="https://www.microsoft.com/en-us/download/details.aspx?id=49077">Update for Windows 7 (KB2999226)</a>
        <a href="https://www.microsoft.com/en-in/download/details.aspx?id=48145">Visual C++ Redistributable for Visual Studio 2015</a>
        <a href="https://www.microsoft.com/en-us/download/details.aspx?id=14431">Microsoft Visual C++ 2005 Service Pack 1 Redistributable Package ATL Security Update</a>
    </body>
    </html>
  2. Run htmltest without any arguments on the file:

    $ ./htmltest test.html

.htmltest.yml

There is no configuration file used in this case.

Expected behaviour

I expect htmltest to not print any errors

Actual behaviour

htmltest prints the following error messages:

htmltest started at 08:44:09 on .
========================================================================
test.html
  Non-OK status: 403 --- test.html --> https://www.microsoft.com/en-us/download/details.aspx?id=13255
  Non-OK status: 403 --- test.html --> https://www.microsoft.com/en-us/download/details.aspx?id=49077
  Non-OK status: 403 --- test.html --> https://www.microsoft.com/en-in/download/details.aspx?id=48145
  Non-OK status: 403 --- test.html --> https://www.microsoft.com/en-us/download/details.aspx?id=14431
========================================================================
✘✘✘ failed in 430.673973ms
4 errors

Versions

Additional context

Issueing a 'HEAD' or 'GET' request using the 'httpie' utility (or cURL) works for any of the URLs in the document, resulting in a HTTP 200 response, e.g.:

➜  curl -I 'https://www.microsoft.com/en-us/download/details.aspx?id=13255'
HTTP/2 200
cache-control: no-cache, no-store
...
frerich commented 4 years ago

Another observation: htmltest sometimes reports HTTP 403 for pages for which cURL reports HTTP 302 (a redirect) followed by HTTP 404. Here's the cURL output printing just the headers (and following redirects) for one such URL, https://www.microsoft.com/en-us/download/details.aspx?id=26607 :

➜  curl -LI https://www.microsoft.com/en-us/download/details.aspx\?id\=26607
HTTP/2 302
content-type: text/html
location: https://www.microsoft.com/en-us/download/404Error.aspx
access-control-allow-headers: Origin, X-Requested-With, Content-Type, Accept
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
access-control-allow-credentials: true
p3p: CP="ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI"
x-frame-options: SAMEORIGIN
expires: Tue, 18 Aug 2020 06:58:36 GMT
cache-control: max-age=0, no-cache, no-store
pragma: no-cache
date: Tue, 18 Aug 2020 06:58:36 GMT
set-cookie: MS-CV=2UsWOH1ZwkiollHO.1; domain=.microsoft.com; expires=Wed, 19-Aug-2020 06:58:33 GMT; path=/;samesite=None
tls_version: tls1.2
strict-transport-security: max-age=31536000
x-rtag: StMus

HTTP/2 404
cache-control: private
content-length: 85792
content-type: text/html
correlationvector: OFPw0nMMUUGpq+f0.1.0
access-control-allow-headers: Origin, X-Requested-With, Content-Type, Accept
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
access-control-allow-credentials: true
p3p: CP="ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI"
x-frame-options: SAMEORIGIN
date: Tue, 18 Aug 2020 06:58:36 GMT
set-cookie: MS-CV=OFPw0nMMUUGpq+f0.1; domain=.microsoft.com; expires=Wed, 19-Aug-2020 06:58:36 GMT; path=/;samesite=None
tls_version: tls1.2
strict-transport-security: max-age=31536000
x-rtag: StMus

The same URL, when used in a HTML page, makes the htmltest link checker report a HTTP 403.

frerich commented 4 years ago

In case anybody else is fighting with this: a workaround is to add

IgnoreExternalBrokenLinks: true

to the .htmltest.yml configuration file.

frerich commented 4 years ago

I think this is because of the user agent: a plain HTTPie request with the user agent "htmltest/123" yields a 403, indeed.

frerich commented 4 years ago

I found a better workaround than ignoring external links: overriding the default user agent by setting a new one in the .htmltest.yml file:

HTTPHeaders: {"User-Agent": "Bacon/123"}
wjdp commented 3 years ago

Hey @frerich, thanks for the writeup! Sorry for not replying to this sooner. It seems that some sites either explicitly or by accident block the htmltest/* user agent.

I've poked a bit and found any user agent starting with htmltest is blocked on some microsoft.com pages. Setting your own UA is pretty much the only solution here!

wjdp commented 3 years ago

As there's not much we can do here gonna close.