Note: edited from original post yesterday, which were questions asking about linkchecker "errors" -- modified today as comments, as there were no replies, and I did additional testing which basically resolves those questions. Since I could not delete this Issue, it's a better option than leaving an uninformed, initial post.
Update 2: I noticed that this is the older linkchecker repo (u/wummel) and I am using (system installed: Arch Linux / AUR) the newer https://github.com/linkchecker/linkchecker repo ... Nonetheless, I think my comments below should largely hold.
I am a novice linkchecker user, with an online domain (BuriedTruth.com) and tons of local HTML source files that I want to scan for URL errors.
linkchecker scans of my online website failed likely due to PHP-based directory "non-browse" restrictions I placed there some time ago.
Though you can chain files / domains to scan, it's better to do them individually as this provides more fine-grained control over directories / content you wish to scan, as well as recursion depth. E.g. (\-wrapped here for readability):
time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
linkchecker-test_file.html | \
tee linkchecker_errors1.txt
0:02.62
time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/index.html | \
tee linkchecker_errors2.txt
0:01.18
## recursion (-r 1) not deep enough; increase depth one level (-r 2):
time linkchecker --check-extern -a -P 1 -r 1 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/docs | \
tee linkchecker_errors3.txt
0:02.73
## use this:
time linkchecker --check-extern -a -P 1 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/docs | \
tee linkchecker_errors4.txt
12:29.33
time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/index.html.test --ignore-url=cnp_members.* | \
tee linkchecker_errors5.txt
0:01.18
time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html | \
tee linkchecker_errors5.txt
0:01.97
time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html | \
tee linkchecker_errors6.txt
0:02.10
Sources scanned must end in .htm or similar: linkchecker scans on this.html passed; tests on this.html.test failed, for example.
A scan of a file, linkchecker-test_file.html passed, as did a scan when that file was referenced as a URL inside a different HTML page [source files pasted below.]
Like many other Users have commented here and elsewhere, verifications pf Wikipedia URLs are buggy (external to linkchecker, as affecting other software tools, Python packages, ...) -- the so-called "443"-type errors. Bah!
HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url:
Based on my initial impressions, all in all linkchecker is a very good and useful tool, I believe. :-) Much appreciated. :_)
[victoria@victoria linkchecker-tests]$ time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html | tee linkchecker_errors5.txt
LinkChecker 9.4.0 Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html
Start checking at 2020-10-10 11:14:27-007
URL `ftp://abc123.xyx/'
Name `ftp://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 18, col 6
Real URL ftp://abc123.xyx/
Check time 0.041 seconds
Result Error: Hostname not found
URL `ftp://def456.jkl/'
Name `ftp://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 21, col 6
Real URL ftp://def456.jkl/
Check time 0.045 seconds
Result Error: Hostname not found
URL `https://abc123.xyx/'
Name `https://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 16, col 6
Real URL https://abc123.xyx/
Check time 0.086 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='abc123.xyx', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f52842850d0>: Failed to establish
a new connection: [Errno -...
URL `http://abc123.xyx/'
Name `http://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 17, col 6
Real URL http://abc123.xyx/
Check time 0.562 seconds
Result Error: ConnectionError: HTTPConnectionPool(host='abc123.xyx', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52852d4a90>: Failed to establish
a new connection: [Errno -2] ...
URL `http://def456.jkl/'
Name `http://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 20, col 6
Real URL http://def456.jkl/
Check time 0.563 seconds
Result Error: ConnectionError: HTTPConnectionPool(host='def456.jkl', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f528415b4d0>: Failed to establish
a new connection: [Errno -2] ...
URL `https://buriedtruth/'
Name `https://buriedtruth/'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 26, col 29
Real URL https://buriedtruth/
Check time 0.680 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='buriedtruth', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5284285090>: Failed to establish
a new connection: [Errno ...
URL `https://def456.jkl/'
Name `https://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 19, col 6
Real URL https://def456.jkl/
Check time 0.686 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='def456.jkl', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5284285c50>: Failed to establish
a new connection: [Errno -...
Statistics:
Downloaded: 949B.
Content types: 0 image, 2 text, 0 video, 0 audio, 0 application, 0 mail and 7 other.
URL lengths: min=17, max=90, avg=26.
That's it. 9 links in 9 URLs checked. 0 warnings found. 7 errors found.
Stopped checking at 2020-10-10 11:14:27-007 (0.72 seconds)
Command exited with non-zero status 1
0:01.87
[victoria@victoria linkchecker-tests]$ time linkchecker --check-extern -a -P 5 -r 2 --no-status --no-warnings -otext \
/mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html | tee linkchecker_errors8.txt
LinkChecker 9.4.0 Copyright (C) 2000-2014 Bastian Kleineidam
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it
under certain conditions. Look at the file `LICENSE' within this
distribution.
Get the newest version at http://wummel.github.io/linkchecker/
Write comments and bugs to https://github.com/wummel/linkchecker/issues
Support this project at http://wummel.github.io/linkchecker/donations.html
Start checking at 2020-10-10 11:14:43-007
URL `ftp://abc123.xyx/'
Name `ftp://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 18, col 6
Real URL ftp://abc123.xyx/
Check time 0.043 seconds
Result Error: Hostname not found
URL `ftp://def456.jkl/'
Name `ftp://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 21, col 6
Real URL ftp://def456.jkl/
Check time 0.043 seconds
Result Error: Hostname not found
URL `https://buriedtruth/'
Name `https://buriedtruth/'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file2.html, line 21, col 29
Real URL https://buriedtruth/
Check time 0.087 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='buriedtruth', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f177019ae90>: Failed to establish
a new connection: [Errno ...
URL `https://abc123.xyx/'
Name `https://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 16, col 6
Real URL https://abc123.xyx/
Check time 0.086 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='abc123.xyx', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f17721e49d0>: Failed to establish
a new connection: [Errno -...
URL `http://abc123.xyx/'
Name `http://abc123.xyx'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 17, col 6
Real URL http://abc123.xyx/
Check time 0.595 seconds
Result Error: ConnectionError: HTTPConnectionPool(host='abc123.xyx', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f17701b0690>: Failed to establish
a new connection: [Errno -2] ...
URL `http://def456.jkl/'
Name `http://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 20, col 6
Real URL http://def456.jkl/
Check time 0.593 seconds
Result Error: ConnectionError: HTTPConnectionPool(host='def456.jkl', port=80): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f177014b1d0>: Failed to establish
a new connection: [Errno -2] ...
URL `https://def456.jkl/'
Name `https://def456.jkl'
Parent URL file:///mnt/Vancouver/domains/buriedtruth.com/linkchecker-tests/linkchecker-test_file.html, line 19, col 6
Real URL https://def456.jkl/
Check time 0.971 seconds
Result Error: ConnectionError: HTTPSConnectionPool(host='def456.jkl', port=443): Max retries exceeded with url: /
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f17701b08d0>: Failed to establish
a new connection: [Errno -...
Statistics:
Downloaded: 1KB.
Content types: 0 image, 3 text, 0 video, 0 audio, 0 application, 0 mail and 7 other.
URL lengths: min=17, max=91, avg=33.
That's it. 10 links in 10 URLs checked. 0 warnings found. 7 errors found.
Stopped checking at 2020-10-10 11:14:44-007 (1 seconds)
Command exited with non-zero status 1
0:02.18
[victoria@victoria linkchecker-tests]$
Note: edited from original post yesterday, which were questions asking about
linkchecker
"errors" -- modified today as comments, as there were no replies, and I did additional testing which basically resolves those questions. Since I could not delete this Issue, it's a better option than leaving an uninformed, initial post.Update 2: I noticed that this is the older linkchecker repo (u/wummel) and I am using (system installed: Arch Linux / AUR) the newer https://github.com/linkchecker/linkchecker repo ... Nonetheless, I think my comments below should largely hold.
I am a novice
linkchecker
user, with an online domain (BuriedTruth.com) and tons of local HTML source files that I want to scan for URL errors.linkchecker
scans of my online website failed likely due to PHP-based directory "non-browse" restrictions I placed there some time ago.A quick-fix attempt, adding
User-Agent: LinkChecker
torobots.txt
(per https://development.robinwinslow.uk/2013/10/03/linkchecker/) did not seem to help. Regardless, I'd prefer scanning local files, then correcting online errors.After some trial and error, I found:
\
-wrapped here for readability):Sources scanned must end in
.htm
or similar:linkchecker
scans onthis.html
passed; tests onthis.html.test
failed, for example.A scan of a file,
linkchecker-test_file.html
passed, as did a scan when that file was referenced as a URL inside a different HTML page [source files pasted below.]Like many other Users have commented here and elsewhere, verifications pf Wikipedia URLs are buggy (external to
linkchecker
, as affecting other software tools, Python packages, ...) -- the so-called "443"-type errors. Bah!HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url:
Based on my initial impressions, all in all
linkchecker
is a very good and useful tool, I believe. :-) Much appreciated. :_)linkchecker-test_file.html
linkchecker-test_file2.html
Tests