wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

Linkchecker interprets URLs with double-slashes as single-slashes #632

Open gwern opened 8 years ago

gwern commented 8 years ago

In a recent mailing list entry which included the URL http://www.gwern.net//Longevity#metformin (note the double-slash //) where I had meant to write http://www.gwern.net/Longevity#metformin, linkchecker failed to flag the broken URL during several checks I had made of the draft.

The double-slash URL is wrong and leads to a 404 error when I check in Chromium, Firefox, elinks, wget, and curl. However, linkchecker does not flag an error when it is asked to check a file with that URL linked.

Here is an example input:

$ cat /tmp/burl16370ems.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <title></title>
  <link rel="stylesheet" href="http://www.gwern.net/static/css/default.css" type="text/css" />
</head>
<body>
<ol style="list-style-type: decimal">
<li><a href="http://www.gwern.net/Longevity#metformin">metformin for life-extension 1</a></li>
<li><a href="http://www.gwern.net//Longevity#metformin">metformin for life-extension 2</a></li>
</ol>
</body>
</html>

This should yield one valid link, and 1 error. However, linkchecker believes it yields 2 valid links and 0 errors:

$ linkchecker -r1 '/tmp/burl16370ems.html'
LinkChecker 8.6              Copyright (C) 2000-2013 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2016-02-05 12:25:48-004
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 1 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 2 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 3 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 4 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 5 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 6 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 7 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 8 seconds

Statistics:
Downloaded: 41.47KB
Robots.txt cache: 0 hits, 1 miss
Number of domains: 2
Content types: 0 image, 3 text, 0 video, 0 audio, 0 application, 0 mail and 0 other.
URL lengths: min=29, max=46, avg=38.

That's it. 3 links checked. 0 warnings found. 0 errors found.
Stopped checking at 2016-02-05 12:25:56-004 (8 seconds)

More detailed output:

$ linkchecker --verbose --complete '/tmp/burl16370ems.html'
LinkChecker 8.6              Copyright (C) 2000-2013 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2016-02-05 12:30:42-004

URL        `file:///tmp/burl16370ems.html'
Name       `/tmp/burl16370ems.html'
Real URL   file:///tmp/burl16370ems.html
Check time 0.004 seconds
D/L time   0.000 seconds
Size       784B
Info       3 URLs parsed.
Modified   2016-02-05 17:25:37.504484Z
Result     Valid

URL        `http://www.gwern.net/static/css/default.css'
Parent URL file:///tmp/burl16370ems.html, line 9, col 3
Real URL   file:///home/gwern/wiki/static/css/default.css
Check time 0.000 seconds
Size       11.33KB
Modified   2016-01-20 23:39:36.368695Z
Result     Valid
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 1 seconds
 1 URL active,     0 URLs queued,    2 URLs checked, runtime 2 seconds

URL        `http://www.gwern.net/Longevity#metformin'
Name       `metformin for life-extension 1'
Parent URL file:///tmp/burl16370ems.html, line 13, col 5
Real URL   http://www.gwern.net/Longevity#metformin
Check time 1.513 seconds
D/L time   1.414 seconds
Size       41.38KB
Modified   2016-02-01 02:25:33.000000Z
Result     Valid: 200 OK

Statistics:
Downloaded: 41.38KB
Robots.txt cache: 0 hits, 1 miss
Number of domains: 2
Content types: 0 image, 3 text, 0 video, 0 audio, 0 application, 0 mail and 0 other.
URL lengths: min=29, max=46, avg=38.

That's it. 3 links checked. 0 warnings found. 0 errors found.
Stopped checking at 2016-02-05 12:30:45-004 (2 seconds)

My guess is that maybe it's internally rewriting double-slashes to single-slashes to get a valid URL, thus seeing only 1 link to check (the valid one), even though this means that it hides the existence of links broken in all browsers I could check.

dpalic commented 7 years ago

Thank you for the issue report. Sadly this project is dead, and a new team is around with https://github.com/linkcheck/linkchecker for more details please see: #708 Also please close this issue and report it freshly on the new repo https://github.com/linkcheck/linkchecker/issues if your issue still persists