wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

LinkChecker won't recurse into site #680

Open SilkAndSlug opened 7 years ago

SilkAndSlug commented 7 years ago

I don't know why LinkChecker 9.3-4 isn't recursing into my site, when it will into others. LinkChecker 8.6 checked 182 links, with the equivalent config. Any advice?

This works:

linkchecker -f ../linkchecker.rc http://www.google.de

This does not:

linkchecker -f ../linkchecker.rc http://admin.vikingsonline.org.uk

This is my config:

[HtmlSyntaxCheck]

This is the output (with -D all). Note that I'm also affected by Issue #601 (erroneous ascii-error).

DEBUG 2016-09-30 12:20:30,017 MainThread Python 2.7.9 (default, Jun 29 2016, 13:08:31) [GCC 4.9.2] on linux2 DEBUG 2016-09-30 12:20:30,017 MainThread reading configuration from ['../linkchecker.rc'] INFO 2016-09-30 12:20:30,028 MainThread Checking intern URLs only; use --check-extern to check extern URLs. DEBUG 2016-09-30 12:20:30,036 MainThread configuration: [('HtmlSyntaxCheck', None), ('aborttimeout', 300), ('allowedschemes', []), ('authentication', []), ('blacklist', {}), ('checkextern', False), ('cookiefile', None), ('csv', {}), ('debugmemory', False), ('dot', {}), ('enabledplugins', ['HtmlSyntaxCheck']), ('externlinks', []), ('fileoutput', []), ('gml', {}), ('gxml', {}), ('html', {}), ('ignorewarnings', []), ('internlinks', []), ('localwebroot', None), ('logger', 'TextLogger'), ('loginextrafields', {}), ('loginpasswordfield', 'password'), ('loginurl', None), ('loginuserfield', 'login'), ('maxfilesizedownload', 5242880), ('maxfilesizeparse', 1048576), ('maxhttpredirects', 10), ('maxnumurls', None), ('maxrequestspersecond', 10), ('maxrunseconds', None), ('nntpserver', None), ('none', {}), ('output', 'text'), ('pluginfolders', []), ('proxy', {}), ('quiet', False), ('recursionlevel', -1), ('sitemap', {}), ('sql', {}), ('sslverify', True), ('status', True), ('status_wait_seconds', 5), ('text', {}), ('threads', 10), ('timeout', 60), ('trace', False), ('useragent', u'Mozilla/5.0 (compatible; LinkChecker/9.3; +http://wummel.github.io/linkchecker/)'), ('verbose', False), ('warnings', True), ('xml', {})] DEBUG 2016-09-30 12:20:30,037 MainThread Enable content plugin HtmlSyntaxCheck DEBUG 2016-09-30 12:20:30,038 MainThread HttpUrl handles url https://admin.vikingsonline.org.uk DEBUG 2016-09-30 12:20:30,038 MainThread checking syntax DEBUG 2016-09-30 12:20:30,039 MainThread Add intern pattern u'^https?://(www.|)admin.vikingsonline.org.uk' DEBUG 2016-09-30 12:20:30,040 MainThread Link pattern u'^https?://(www.|)admin.vikingsonline.org.uk' strict=False DEBUG 2016-09-30 12:20:30,041 MainThread queueing https://admin.vikingsonline.org.uk LinkChecker 9.3 Copyright (C) 2000-2014 Bastian Kleineidam LinkChecker comes with ABSOLUTELY NO WARRANTY! This is free software, and you are welcome to redistribute it under certain conditions. Look at the file `LICENSE' within this distribution. Get the newest version at http://wummel.github.io/linkchecker/ Write comments and bugs to https://github.com/wummel/linkchecker/issues Support this project at http://wummel.github.io/linkchecker/donations.html

Start checking at 2016-09-30 12:20:30+001 DEBUG 2016-09-30 12:20:30,057 CheckThread-https://admin.vikingsonline.org.uk Checking https link base_url=u'https://admin.vikingsonline.org.uk' parent_url=None base_ref=None recursion_level=0 url_connection=None line=0 column=0 page=0 name=u'' anchor=u'' cache_url=https://admin.vikingsonline.org.uk DEBUG 2016-09-30 12:20:30,060 CheckThread-https://admin.vikingsonline.org.uk checking connection 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 1 seconds DEBUG 2016-09-30 12:20:35,689 CheckThread-https://admin.vikingsonline.org.uk u'https://admin.vikingsonline.org.uk/robots.txt' parse lines DEBUG 2016-09-30 12:20:35,689 CheckThread-https://admin.vikingsonline.org.uk Parsed rules:

DEBUG 2016-09-30 12:20:35,690 CheckThread-https://admin.vikingsonline.org.uk u'https://admin.vikingsonline.org.uk/robots.txt' check allowance for: user agent: u'Mozilla/5.0 (compatible; LinkChecker/9.3; +http://wummel.github.io/linkchecker/)' url: u'https://admin.vikingsonline.org.uk' ... DEBUG 2016-09-30 12:20:35,690 CheckThread-https://admin.vikingsonline.org.uk ... agent not found, allow. DEBUG 2016-09-30 12:20:35,690 CheckThread-https://admin.vikingsonline.org.uk Prepare request with {'url': u'https://admin.vikingsonline.org.uk', 'headers': {}, 'method': 'GET'} DEBUG 2016-09-30 12:20:35,691 CheckThread-https://admin.vikingsonline.org.uk Send request with {'allow_redirects': False, 'stream': True, 'verify': True, 'timeout': 60} 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 6 seconds 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 11 seconds 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 16 seconds DEBUG 2016-09-30 12:20:46,530 CheckThread-https://admin.vikingsonline.org.uk Got SSL certificate {'subjectAltName': [('DNS', 'admin.vikingsonline.org.uk')], 'subject': ((('commonName', u'admin.vikingsonline.org.uk'),),), 'notAfter': 'Nov 28 00:22:00 2016 GMT'} DEBUG 2016-09-30 12:20:46,531 CheckThread-https://admin.vikingsonline.org.uk follow all redirections DEBUG 2016-09-30 12:20:46,531 CheckThread-https://admin.vikingsonline.org.uk Run plugin HtmlSyntaxCheck 1 thread active, 0 links queued, 0 links in 0 URLs checked, runtime 21 seconds WARNING 2016-09-30 12:20:53,924 CheckThread-https://admin.vikingsonline.org.uk HTML syntax check plugin error: 'ascii' codec can't encode character u'\u2026' in position 1433: ordinal not in range(128) DEBUG 2016-09-30 12:20:53,924 CheckThread-https://admin.vikingsonline.org.uk checking recursion of u'https://admin.vikingsonline.org.uk' ... DEBUG 2016-09-30 12:20:53,924 CheckThread-https://admin.vikingsonline.org.uk meta robots finder DEBUG 2016-09-30 12:20:53,925 CheckThread-https://admin.vikingsonline.org.uk Get content of u'https://admin.vikingsonline.org.uk' DEBUG 2016-09-30 12:20:53,926 CheckThread-https://admin.vikingsonline.org.uk ... yes, recursion. DEBUG 2016-09-30 12:20:53,926 CheckThread-https://admin.vikingsonline.org.uk LinkFinder tag pre attrs {} DEBUG 2016-09-30 12:20:53,926 CheckThread-https://admin.vikingsonline.org.uk line 3 col 6 old line 3 old col 1 DEBUG 2016-09-30 12:20:53,926 CheckThread-https://admin.vikingsonline.org.uk LinkFinder finished tag pre DEBUG 2016-09-30 12:20:53,927 CheckThread-https://admin.vikingsonline.org.uk LinkFinder tag br attrs {} DEBUG 2016-09-30 12:20:53,927 CheckThread-https://admin.vikingsonline.org.uk line 3 col 23 old line 3 old col 17 DEBUG 2016-09-30 12:20:53,927 CheckThread-https://admin.vikingsonline.org.uk LinkFinder finished tag br DEBUG 2016-09-30 12:20:53,927 CheckThread-https://admin.vikingsonline.org.uk LinkFinder tag br attrs {} DEBUG 2016-09-30 12:20:53,927 CheckThread-https://admin.vikingsonline.org.uk line 43 col 13 old line 43 old col 7 DEBUG 2016-09-30 12:20:53,928 CheckThread-https://admin.vikingsonline.org.uk LinkFinder finished tag br DEBUG 2016-09-30 12:20:53,928 CheckThread-https://admin.vikingsonline.org.uk task_done https://admin.vikingsonline.org.uk

Statistics: Downloaded: 846B. Content types: 0 image, 1 text, 0 video, 0 audio, 0 application, 0 mail and 0 other. URL lengths: min=34, max=34, avg=34.

That's it. 1 link in 1 URL checked. 0 warnings found. 0 errors found. Stopped checking at 2016-09-30 12:20:53+001 (23 seconds)

dpalic commented 6 years ago

Thank you for the issue report. Sadly this project is dead, and a new team is around with https://github.com/linkcheck/linkchecker for more details please see: #708 Also please close this issue and report it freshly on the new repo https://github.com/linkcheck/linkchecker/issues