Closed valerio65xz closed 7 years ago
Enable the HttpHeaderInfo option in your configuration file, and set 'prefixes=Content-Type'
(I just needed to do this myself: seems to work ok but also adds proxy information to the field for some reason I haven't worked out yet)
I tried it. In linkcheckerrc file, I set this:
# Print HTTP header info #[HttpHeaderInfo] # Comma separated list of header prefixes to print. # The names are case insensitive. # The default list is empty, so it should be non-empty when activating # this plugin. prefixes=Content-Type
and then I call from the CLI:
linkchecker http://www.google.com -v
But it show always this:
URL 'https://www.google.com/sitemap.xml' Parent URL http://www.google.com, line 255 Real URL https://www.google.com/sitemap.xml Check time 0.756 seconds D/L time 0.001 seconds Size 1KB Result Valid: 200 OK
There is no info about content type...
It works for me from both command line and GUI. Make sure you haven't left the line '[HttpHeaderInfo]' commented out as in the example you pasted.
URL
https://www.google.com/sitemap.xml'
Parent URL https://www.google.com/sitemap.xml, line 255
Real URL https://www.google.com/sitemap.xml
Check time 0.741 seconds
D/L time 0.000 seconds
Size 1KB
Info HTTP headers Content-type=text/xml
Result Valid: 200 OK
`
Now that's my linkcheckerrc file
[HttpHeaderInfo] # Comma separated list of header prefixes to print. # The names are case insensitive. # The default list is empty, so it should be non-empty when activating # this plugin. prefixes=Content-Type
But still not working
linkchecker http://www.google.com/ -v
URL 'https://www.google.com/sitemap.xml' Parent URL http://www.google.com/, line 255 Real URL https://www.google.com/sitemap.xml Check time 0.486 seconds D/L time 0.000 seconds Size 1KB Result Valid: 200 OK
Hi Seamang, I try to enable HttpHeaderInfo on the same way of Valerio and I didnt see the header content type. Could you share your config ? Maybe you are working on Linux or mac ? Thanks in advance. Regards.
gHena: my config is the one that comes with the distribution with the only change being the [HttpHeaderInfo] exactly as valerio65xz shows above.
I'm using a slightly modified version of the github source running on Linux. I don't think my small changes have anything to do with this problem. Sorry I can't help more. Please support @anarcat in his efforts to contact @wummel and get this project moving again if you can! See #686
Ok, maybe this is an issue with Windows systems... thanks!
The problem is caused by a change in the behaviour of CaseInsensitiveDict from the requests library. In some versions this returns lower-cased keys, in others mixed case. A simple solution is to force the keys values to lower case in plugins/httpheaderinfo.check() in lines 39 and 40. If you can't modify the source you're stuck till someone can do a new release, I'm afraid.
What is the instructions to force the keys value to lower case? Can you make an example please?
def check(self, url_data):
"""Check content for invalid anchors."""
headers = []
for name, value in url_data.headers.items():
if name.lower().startswith(self.prefixes):
headers.append(name.lower())
if headers:
items = [u"%s=%s" % (name.capitalize(), url_data.headers[name]) for name in headers]
info = u"HTTP headers %s" % u", ".join(items)
url_data.add_info(info)
Ok perfect now that's work!! Thank you so much :)
@seamang could you send your changes as a pull request? even though there's no activity here, at least we can keep track of it when / if we fork..
Hello! I really need to check the content type of the urls detected (from the command line link checker). I have to filter just the HTML/text urls. Anyone know how to do it? Thanks.