wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

Linkchecker internal error (apostrophe handling?) #732

Open chrishanretty opened 7 years ago

chrishanretty commented 7 years ago

I'm reporting an internal error as requested. Full output is below. The error was repeated several times with pages on this site: the common factor seems to be the presence of an apostrophe in the url.

** Oops, I did it again. *****

You have found an internal error in LinkChecker. Please write a bug report at https://github.com/wummel/linkchecker/issues and include the following information:

When using the commandline client:

Not disclosing some of the information above due to privacy reasons is ok. I will try to help you nonetheless, but you have to give me something I can work with ;) .

Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/linkcheck/director/checker.py", line 104, in check_url line: self.check_url_data(url_data) locals: self = <Checker(CheckThread-https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx?nomobile=0, started 140621508146944)> self.check_url_data = <bound method Checker.check_url_data of <Checker(CheckThread-https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx?nomobile=0, started 140621508146944)>> url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... File "/usr/lib/python2.7/dist-packages/linkcheck/director/checker.py", line 120, in check_url_data line: check_url(url_data, self.logger) locals: check_url = <function check_url at 0x7fe5007c7230> url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... self = <Checker(CheckThread-https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx?nomobile=0, started 140621508146944)> self.logger = <linkcheck.director.logger.Logger object at 0x7fe501af9e50> File "/usr/lib/python2.7/dist-packages/linkcheck/director/checker.py", line 64, in check_url line: parser.parse_url(url_data) locals: parser = <module 'linkcheck.parser' from '/usr/lib/python2.7/dist-packages/linkcheck/parser/init.pyc'> parser.parse_url = <function parse_url at 0x7fe5007bf848> url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... File "/usr/lib/python2.7/dist-packages/linkcheck/parser/init.py", line 39, in parse_url line: globals()funcname locals: globals = funcname = 'parse_html', len = 10 url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... File "/usr/lib/python2.7/dist-packages/linkcheck/parser/init.py", line 48, in parse_html line: find_links(url_data, url_data.add_url, linkparse.LinkTags) locals: find_links = <function find_links at 0x7fe5007bfc80> url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... url_data.add_url = <bound method HttpUrl.add_url of <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, n... linkparse = <module 'linkcheck.htmlutil.linkparse' from '/usr/lib/python2.7/dist-packages/linkcheck/htmlutil/linkparse.pyc'> linkparse.LinkTags = {'tr': [u'background'], 'q': [u'cite'], 'meta': [u'content', u'href'], 'isindex': [u'action'], 'track': [u'src'], 'applet': [u'archive', u'src'], 'object': [u'classid', u'data', u'archive', u'usemap', u'codebase'], None: [u'style', u'itemtype'], 'layer': [u'background', u'src'], 'html': [u'manife..., len = 35 File "/usr/lib/python2.7/dist-packages/linkcheck/parser/init.py", line 126, in find_links line: parser.feed(url_data.get_content()) locals: parser = <linkcheck.HtmlParser.htmlsax.parser object at 0x7fe4cb190418> parser.feed = <built-in method feed of linkcheck.HtmlParser.htmlsax.parser object at 0x7fe4cb190418> url_data = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... url_data.get_content = <bound method HttpUrl.get_content of <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=... File "/usr/lib/python2.7/dist-packages/linkcheck/htmlutil/linkparse.py", line 231, in start_element line: self.parse_tag(tag, attr, value, name, base) locals: self = <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7fe4dcec1910> self.parse_tag = <bound method LinkFinder.parse_tag of <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7fe4dcec1910>> tag = u'link' attr = u'href' value = u'/siteelements/styles/100-system.css?version=2692258?version=2692258', len = 67 name = u'' base = u'' File "/usr/lib/python2.7/dist-packages/linkcheck/htmlutil/linkparse.py", line 277, in parse_tag line: self.found_url(value, name, base) locals: self = <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7fe4dcec1910> self.found_url = <bound method LinkFinder.found_url of <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7fe4dcec1910>> value = u'/siteelements/styles/100-system.css?version=2692258?version=2692258', len = 67 name = u'' base = u'' File "/usr/lib/python2.7/dist-packages/linkcheck/htmlutil/linkparse.py", line 283, in found_url line: column=self.parser.last_column(), name=name, base=base) locals: column = self = <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7fe4dcec1910> self.parser = <linkcheck.HtmlParser.htmlsax.parser object at 0x7fe4cb190418> self.parser.last_column = <built-in method last_column of linkcheck.HtmlParser.htmlsax.parser object at 0x7fe4cb190418> name = u'' base = u'' File "/usr/lib/python2.7/dist-packages/linkcheck/checker/urlbase.py", line 653, in add_url line: page=page, name=name, parent_content_type=self.content_type) locals: page = 0 name = u'' parent_content_type = self = <https link, base_url=u'?nomobile=0', parent_url=u'https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may%27s-gamble-at-the-polls-failed.aspx', base_ref=None, recursion_level=3, url_connection=None, line=613, column=49, page=0, name=u'Mobile site view', anchor=u... self.content_type = 'text/html', len = 9 File "/usr/lib/python2.7/dist-packages/linkcheck/checker/init.py", line 125, in get_url_from line: line=line, column=column, page=page, name=name, extern=extern) locals: line = 8 column = 422 page = 0 name = u'' extern = None File "/usr/lib/python2.7/dist-packages/linkcheck/checker/urlbase.py", line 117, in init line: aggregate, line, column, page, name, url_encoding, extern) locals: aggregate = <linkcheck.director.aggregator.Aggregate object at 0x7fe501af9610> line = 8 column = 422 page = 0 name = u'' url_encoding = None extern = None File "/usr/lib/python2.7/dist-packages/linkcheck/checker/urlbase.py", line 157, in init line: "unquoted parent URL %r" % self.parent_url locals: self = <None link, base_url=u'/siteelements/styles/100-system.css?version=2692258?version=2692258', parent_url=u"https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may's-gamble-at-the-polls-failed.aspx?478490430", base_ref=None, recursion_level=4, url_connection=None, ... self.parent_url = u"https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may's-gamble-at-the-polls-failed.aspx?478490430", len = 133 AssertionError: unquoted parent URL u"https://www.royalholloway.ac.uk/politicsandir/research/dec/blogs/articles/why-theresa-may's-gamble-at-the-polls-failed.aspx?478490430" System info: LinkChecker 9.3 Released on: 16.7.2014 Python 2.7.13 (default, Jan 19 2017, 14:48:08) [GCC 6.3.0 20170118] on linux2 Requests: 2.10.0 Modules: Sqlite Local time: 2017-08-29 12:35:20+001 sys.argv: ['/usr/bin/linkchecker', 'https://www.royalholloway.ac.uk/politicsandir/home.aspx'] LANGUAGE = 'en_GB:en' LANG = 'en_GB.UTF-8' Default locale: ('en', 'UTF-8')

dpalic commented 6 years ago

Thank you for the issue report. Sadly this project is dead, and a new team is around with https://github.com/linkcheck/linkchecker for more details please see: #708 Also please close this issue and report it freshly on the new repo https://github.com/linkcheck/linkchecker/issues