wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

Checking local files [usage notes] #797

Closed victoriastuart closed 3 years ago

victoriastuart commented 3 years ago

Note: edited from original post yesterday, which were questions asking about linkchecker "errors" -- modified today as comments, as there were no replies, and I did additional testing which basically resolves those questions. Since I could not delete this Issue, it's a better option than leaving an uninformed, initial post.


Update 2: I noticed that this is the older linkchecker repo (u/wummel) and I am using (system installed: Arch Linux / AUR) the newer https://github.com/linkchecker/linkchecker repo ... Nonetheless, I think my comments below should largely hold.


I am a novice linkchecker user, with an online domain (BuriedTruth.com) and tons of local HTML source files that I want to scan for URL errors.

linkchecker scans of my online website failed likely due to PHP-based directory "non-browse" restrictions I placed there some time ago.

A quick-fix attempt, adding 


User-Agent: LinkChecker to robots.txt (per https://development.robinwinslow.uk/2013/10/03/linkchecker/) did not seem to help. Regardless, I'd prefer scanning local files, then correcting online errors.

After some trial and error, I found:

time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
linkchecker-test_file.html | \
tee linkchecker_errors1.txt
  0:02.62

time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/index.html | \
tee linkchecker_errors2.txt
  0:01.18

## recursion (-r 1) not deep enough; increase depth one level (-r 2):
time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/docs | \
tee linkchecker_errors3.txt
  0:02.73

## use this:
time linkchecker --check-extern -a -P 1 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/docs | \
tee linkchecker_errors4.txt
  12:29.33

time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/index.html.test --ignore-url=cnp_members.* | \
tee linkchecker_errors5.txt
  0:01.18

time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html | \
tee linkchecker_errors5.txt
  0:01.97

time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html | \
tee linkchecker_errors6.txt
  0:02.10

Based on my initial impressions, all in all linkchecker is a very good and useful tool, I believe. :-) Much appreciated. :_)


linkchecker-test_file.html

<!DOCTYPE html>
<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">

<HTML>
<HEAD>
  <meta charset="UTF-8" />
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <TITLE>`linkchecker` HTML Test File</TITLE>
</HEAD>

<BODY style="background-color:#f8f9f9">
<UL>

<H1>`linkchecker` HTML Test File</H1>

<li> <a href="https://abc123.xyx/">https://abc123.xyx</a>
<li> <a href="http://abc123.xyx/">http://abc123.xyx</a>
<li> <a href="ftp://abc123.xyx/">ftp://abc123.xyx</a>
<li> <a href="https://def456.jkl/">https://def456.jkl</a>
<li> <a href="http://def456.jkl/">http://def456.jkl</a>
<li> <a href="ftp://def456.jkl/">ftp://def456.jkl</a>

<p><hr size="2" color="lightgrey" width=75% align=left></p>

<li> <p><b>Working URL: <a href="https://buriedtruth.com/">https://buriedtruth.com/</a></b></p>
<li> <p><b>Non-working URL: <a href="https://buriedtruth/">https://buriedtruth/</a></b></p>

<br /><br />
</UL>
</BODY>
</HTML>

linkchecker-test_file2.html

<!DOCTYPE html>
<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">

<HTML>
<HEAD>
  <meta charset="UTF-8" />
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <TITLE>`linkchecker` HTML Test File</TITLE>
</HEAD>

<BODY style="background-color:#f8f9f9">
<UL>

<H1>`linkchecker` HTML Test File</H1>

<a href="/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html">/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html</a>

<p><hr size="2" color="lightgrey" width=75% align=left></p>

<li> <p><b>Working URL: <a href="https://buriedtruth.com/">https://buriedtruth.com/</a></b></p>
<li> <p><b>Non-working URL: <a href="https://buriedtruth/">https://buriedtruth/</a></b></p>

<br /><br />
</UL>
</BODY>
</HTML>

Tests

[victoria@victoria linkchecker-tests]$ time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html | tee linkchecker_errors5.txt

LinkChecker 9.4.0              Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2020-10-10 11:14:27-007

URL        `ftp://abc123.xyx/'
Name       `ftp://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 18, col 6
Real URL   ftp://abc123.xyx/
Check time 0.041 seconds
Result     Error: Hostname not found

URL        `ftp://def456.jkl/'
Name       `ftp://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 21, col 6
Real URL   ftp://def456.jkl/
Check time 0.045 seconds
Result     Error: Hostname not found

URL        `https://abc123.xyx/'
Name       `https://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 16, col 6
Real URL   https://abc123.xyx/
Check time 0.086 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='abc123.xyx', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f52842850d0>: Failed to establish
a new connection: [Errno -...

URL        `http://abc123.xyx/'
Name       `http://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 17, col 6
Real URL   http://abc123.xyx/
Check time 0.562 seconds
Result     Error: ConnectionError: HTTPConnectionPool(host='abc123.xyx', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52852d4a90>: Failed to establish
a new connection: [Errno -2] ...

URL        `http://def456.jkl/'
Name       `http://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 20, col 6
Real URL   http://def456.jkl/
Check time 0.563 seconds
Result     Error: ConnectionError: HTTPConnectionPool(host='def456.jkl', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f528415b4d0>: Failed to establish
a new connection: [Errno -2] ...

URL        `https://buriedtruth/'
Name       `https://buriedtruth/'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 26, col 29
Real URL   https://buriedtruth/
Check time 0.680 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='buriedtruth', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5284285090>: Failed to establish
a new connection: [Errno ...

URL        `https://def456.jkl/'
Name       `https://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 19, col 6
Real URL   https://def456.jkl/
Check time 0.686 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='def456.jkl', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5284285c50>: Failed to establish
a new connection: [Errno -...

Statistics:
Downloaded: 949B.
Content types: 0 image, 2 text, 0 video, 0 audio, 0 application, 0 mail and 7 other.
URL lengths: min=17, max=90, avg=26.

That's it. 9 links in 9 URLs checked. 0 warnings found. 7 errors found.
Stopped checking at 2020-10-10 11:14:27-007 (0.72 seconds)
Command exited with non-zero status 1
0:01.87

[victoria@victoria linkchecker-tests]$ time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html | tee linkchecker_errors8.txt

LinkChecker 9.4.0              Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2020-10-10 11:14:43-007

URL        `ftp://abc123.xyx/'
Name       `ftp://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 18, col 6
Real URL   ftp://abc123.xyx/
Check time 0.043 seconds
Result     Error: Hostname not found

URL        `ftp://def456.jkl/'
Name       `ftp://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 21, col 6
Real URL   ftp://def456.jkl/
Check time 0.043 seconds
Result     Error: Hostname not found

URL        `https://buriedtruth/'
Name       `https://buriedtruth/'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html, line 21, col 29
Real URL   https://buriedtruth/
Check time 0.087 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='buriedtruth', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f177019ae90>: Failed to establish
a new connection: [Errno ...

URL        `https://abc123.xyx/'
Name       `https://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 16, col 6
Real URL   https://abc123.xyx/
Check time 0.086 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='abc123.xyx', port=443): Max retries exceeded with url: / 
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f17721e49d0>: Failed to establish
a new connection: [Errno -...

URL        `http://abc123.xyx/'
Name       `http://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 17, col 6
Real URL   http://abc123.xyx/
Check time 0.595 seconds
Result     Error: ConnectionError: HTTPConnectionPool(host='abc123.xyx', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f17701b0690>: Failed to establish
a new connection: [Errno -2] ...

URL        `http://def456.jkl/'
Name       `http://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 20, col 6
Real URL   http://def456.jkl/
Check time 0.593 seconds
Result     Error: ConnectionError: HTTPConnectionPool(host='def456.jkl', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f177014b1d0>: Failed to establish
a new connection: [Errno -2] ...

URL        `https://def456.jkl/'
Name       `https://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 19, col 6
Real URL   https://def456.jkl/
Check time 0.971 seconds
Result     Error: ConnectionError: HTTPSConnectionPool(host='def456.jkl', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f17701b08d0>: Failed to establish
a new connection: [Errno -...

Statistics:
Downloaded: 1KB.
Content types: 0 image, 3 text, 0 video, 0 audio, 0 application, 0 mail and 7 other.
URL lengths: min=17, max=91, avg=33.

That's it. 10 links in 10 URLs checked. 0 warnings found. 7 errors found.
Stopped checking at 2020-10-10 11:14:44-007 (1 seconds)
Command exited with non-zero status 1
0:02.18
[victoria@victoria linkchecker-tests]$