wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

layered problems in proxy handling #536

Open reidpr opened 10 years ago

reidpr commented 10 years ago

We have a proxy set up using environment variables.

LinkChecker crashed with errors about proxy_type. (Sorry, I didn't record the exact messages. The patch below should help elucidate though.) This turned out to be due to a couple layers of typos in the code. However, then LinkChecker gave errors about proxy URL missing its method.

The following patch made it work for us, though as you can see it's kind of a hack:

--- LinkChecker-9.3/linkcheck/cache/robots_txt.py       2014-07-15 23:34:58.000000000 -0600
+++ LinkChecker-9.3-new/linkcheck/cache/robots_txt.py   2014-07-30 12:48:40.479590861 -0600
@@ -59,7 +59,7 @@
             self.misses += 1
         kwargs = dict(auth=url_data.auth, session=url_data.session)
         if url_data.proxy:
-            kwargs["proxies"] = {url_data.proxy_type, url_data.proxy}
+            kwargs["proxies"] = {url_data.proxytype: url_data.proxy}
         rp = robotparser2.RobotFileParser(**kwargs)
         rp.set_url(roboturl)
         rp.read()
--- LinkChecker-9.3/linkcheck/checker/proxysupport.py   2014-07-15 23:34:58.000000000 -0600
+++ LinkChecker-9.3-new/linkcheck/checker/proxysupport.py       2014-07-30 13:08:48.255590717 -0600
@@ -52,6 +52,7 @@
             self.proxy = self.proxyauth = None
             return
         log.debug(LOG_CHECK, "using proxy %r", self.proxy)
+        self.proxy = self.proxytype + '://' + self.proxy
         self.add_info(_("Using proxy `%(proxy)s'.") % dict(proxy=self.proxy))
         if self.proxyauth is not None:
             if ":" not in self.proxyauth:
ArloL commented 10 years ago

We are seeing a similar issue in the latest release on windows:

DEBUG 2014-08-15 14:29:22,642 MainThread Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] auf win32
DEBUG 2014-08-15 14:29:22,644 MainThread reading configuration from ['d:\\software\\cygwin\\home\\pitann\\.linkchecker\\linkcheckerrc']
WARNUNG 2014-08-15 14:29:22,654 MainThread Die Option python -O verhindert das Debuggen.
INFO 2014-08-15 14:29:22,654 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
DEBUG 2014-08-15 14:29:22,661 MainThread configuration: [('aborttimeout', 300),
 ('allowedschemes', []),
 ('authentication', []),
 ('blacklist', {}),
 ('checkextern', False),
 ('cookiefile', None),
 ('csv', {}),
 ('debugmemory', False),
 ('dot', {}),
 ('enabledplugins', []),
 ('externlinks', []),
 ('fileoutput', []),
 ('gml', {}),
 ('gxml', {}),
 ('html', {}),
 ('ignorewarnings', []),
 ('internlinks', []),
 ('localwebroot', None),
 ('logger', 'TextLogger'),
 ('loginextrafields', {}),
 ('loginpasswordfield', 'password'),
 ('loginurl', None),
 ('loginuserfield', 'login'),
 ('maxfilesizedownload', 5242880),
 ('maxfilesizeparse', 1048576),
 ('maxhttpredirects', 10),
 ('maxnumurls', None),
 ('maxrequestspersecond', 10),
 ('maxrunseconds', None),
 ('nntpserver', None),
 ('none', {}),
 ('output', 'text'),
 ('pluginfolders', []),
 ('proxy',
  {'all': '...',
   'ftp': '...',
   'http': 'http://proxy:8080',
   'https': '...',
   'no': '...'}),
 ('quiet', False),
 ('recursionlevel', -1),
 ('sitemap', {}),
 ('sql', {}),
 ('sslverify',
  'C:\\Program Files (x86)\\LinkChecker\\share\\linkchecker\\cacert.pem'),
 ('status', True),
 ('status_wait_seconds', 5),
 ('text', {}),
 ('threads', 10),
 ('timeout', 60),
 ('trace', False),
 ('useragent',
  u'Mozilla/5.0 (compatible; LinkChecker/9.3; +http://wummel.github.io/linkchecker/)'),
 ('verbose', False),
 ('warnings', True),
 ('xml', {})]
DEBUG 2014-08-15 14:29:22,671 MainThread HttpUrl handles url http://www.herbrand.de
DEBUG 2014-08-15 14:29:22,673 MainThread checking syntax
DEBUG 2014-08-15 14:29:22,674 MainThread Add intern pattern u'^https?://(www\\.|)herbrand\\.de'
DEBUG 2014-08-15 14:29:22,674 MainThread Link pattern u'^https?://(www\\.|)herbrand\\.de' strict=False
DEBUG 2014-08-15 14:29:22,674 MainThread queueing http://www.herbrand.de
DEBUG 2014-08-15 14:29:22,700 CheckThread-http://www.herbrand.de Checking http link
base_url=u'http://www.herbrand.de'
parent_url=None
base_ref=None
recursion_level=0
url_connection=None
line=0
column=0
page=0
name=u''
anchor=u''
cache_url=http://www.herbrand.de
DEBUG 2014-08-15 14:29:22,700 CheckThread-http://www.herbrand.de checking connection
DEBUG 2014-08-15 14:29:22,710 CheckThread-http://www.herbrand.de using proxy 'kev-iwss02.kevelaer.herbrand.de:3128'
DEBUG 2014-08-15 14:29:22,710 CheckThread-http://www.herbrand.de task_done http://www.herbrand.de

********** Hoppla. *************

Sie haben einen internen Fehler in LinkChecker entdeckt. Bitte schreiben Sie
einen Fehlerbericht an https://github.com/wummel/linkchecker/issues
mit den folgenden Informationen:
- die URL oder Datei, welche Sie gerade pr▒fen
- die untenstehenden Systeminformationen.

Bei Benutzung des Kommandozeilenprogramms:
- ihre Kommandozeilenargumente und/oder Ihre Konfiguration.
- die Ausgabe eines Debuglaufs mit Option "-Dall"

Wenn Sie Informationen aus privaten Gr▒nden unterlassen, ist das in Ordnung.
Ich werde trotzdem versuchen, Ihnen zu helfen. Sie m▒ssen mir allerdings
irgendwas geben, womit ich arbeiten kann ;).

Traceback (most recent call last):
  File "linkcheck\director\checker.pyo", line 104, in check_url
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\director\checker.py
    -- code not available --
  File "linkcheck\director\checker.pyo", line 120, in check_url_data
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\director\checker.py
    -- code not available --
  File "linkcheck\director\checker.pyo", line 52, in check_url
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\director\checker.py
    -- code not available --
  File "linkcheck\checker\urlbase.pyo", line 424, in check
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\checker\urlbase.py
    -- code not available --
  File "linkcheck\checker\urlbase.pyo", line 442, in local_check
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\checker\urlbase.py
    -- code not available --
  File "linkcheck\checker\httpurl.pyo", line 128, in check_connection
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\checker\httpurl.py
    -- code not available --
  File "linkcheck\checker\httpurl.pyo", line 66, in allows_robots
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\checker\httpurl.py
    -- code not available --
  File "linkcheck\cache\robots_txt.pyo", line 49, in allows_url
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\cache\robots_txt.py
    -- code not available --
  File "linkcheck\cache\robots_txt.pyo", line 62, in _allows_url
    -- couldn't find file, trying this instead: C:\Program Files (x86)\LinkChecker\library.zip\linkcheck\cache\robots_txt.py
    -- code not available --
AttributeError: 'HttpUrl' object has no attribute 'proxy_type'
Systeminformation:
LinkChecker 9.3
Released on: 16.7.2014
Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] auf win32
Requests: 2.2.1
Qt: 4.8.6 / PyQt: 4.11
Modules: QScintilla, Sqlite
Uhrzeit: 2014-08-15 14:29:22+001
sys.argv: ['C:\\Program Files (x86)\\LinkChecker\\linkchecker.exe', '-Dall', 'www.herbrand.de']
http_proxy = 'http://proxy:8080'
ftp_proxy = '...'
no_proxy = '...'
LANG = 'de_DE.UTF-8'
Standard Locale: ('de', 'cp1252')

 ******** LinkChecker interner Fehler, und tsch▒▒ ********
WARNUNG 2014-08-15 14:29:22,799 CheckThread-http://www.herbrand.de internal error occurred
wummel commented 9 years ago

We believe that the issue you reported is fixed in the source repository of linkchecker which can be found under: https://github.com/wummel/linkchecker

Changelog entry:

Thank you for reporting the issue. It is now marked as fixed. If you believe that the issue is not fixed appropriately just add a comment to this issue.