How to check the content type

wummel / linkchecker

check links in web documents or full websites

http://wummel.github.io/linkchecker/

GNU General Public License v2.0

1.42k stars 234 forks source link

How to check the content type #690

Closed valerio65xz closed 7 years ago

valerio65xz commented 7 years ago

Hello! I really need to check the content type of the urls detected (from the command line link checker). I have to filter just the HTML/text urls. Anyone know how to do it? Thanks.

ghost commented 7 years ago

Enable the HttpHeaderInfo option in your configuration file, and set 'prefixes=Content-Type'

(I just needed to do this myself: seems to work ok but also adds proxy information to the field for some reason I haven't worked out yet)

valerio65xz commented 7 years ago

I tried it. In linkcheckerrc file, I set this:

# Print HTTP header info #[HttpHeaderInfo] # Comma separated list of header prefixes to print. # The names are case insensitive. # The default list is empty, so it should be non-empty when activating # this plugin. prefixes=Content-Type

and then I call from the CLI:

linkchecker http://www.google.com -v

But it show always this:

URL 'https://www.google.com/sitemap.xml' Parent URL http://www.google.com, line 255 Real URL https://www.google.com/sitemap.xml Check time 0.756 seconds D/L time 0.001 seconds Size 1KB Result Valid: 200 OK

There is no info about content type...

ghost commented 7 years ago

It works for me from both command line and GUI. Make sure you haven't left the line '[HttpHeaderInfo]' commented out as in the example you pasted.

URLhttps://www.google.com/sitemap.xml' Parent URL https://www.google.com/sitemap.xml, line 255 Real URL https://www.google.com/sitemap.xml Check time 0.741 seconds D/L time 0.000 seconds Size 1KB Info HTTP headers Content-type=text/xml Result Valid: 200 OK `

valerio65xz commented 7 years ago

Now that's my linkcheckerrc file

[HttpHeaderInfo] # Comma separated list of header prefixes to print. # The names are case insensitive. # The default list is empty, so it should be non-empty when activating # this plugin. prefixes=Content-Type

But still not working

linkchecker http://www.google.com/ -v

URL 'https://www.google.com/sitemap.xml' Parent URL http://www.google.com/, line 255 Real URL https://www.google.com/sitemap.xml Check time 0.486 seconds D/L time 0.000 seconds Size 1KB Result Valid: 200 OK

ghena commented 7 years ago

Hi Seamang, I try to enable HttpHeaderInfo on the same way of Valerio and I didnt see the header content type. Could you share your config ? Maybe you are working on Linux or mac ? Thanks in advance. Regards.

ghost commented 7 years ago

gHena: my config is the one that comes with the distribution with the only change being the [HttpHeaderInfo] exactly as valerio65xz shows above.

I'm using a slightly modified version of the github source running on Linux. I don't think my small changes have anything to do with this problem. Sorry I can't help more. Please support @anarcat in his efforts to contact @wummel and get this project moving again if you can! See #686

valerio65xz commented 7 years ago

Ok, maybe this is an issue with Windows systems... thanks!

ghost commented 7 years ago

The problem is caused by a change in the behaviour of CaseInsensitiveDict from the requests library. In some versions this returns lower-cased keys, in others mixed case. A simple solution is to force the keys values to lower case in plugins/httpheaderinfo.check() in lines 39 and 40. If you can't modify the source you're stuck till someone can do a new release, I'm afraid.

valerio65xz commented 7 years ago

What is the instructions to force the keys value to lower case? Can you make an example please?

ghost commented 7 years ago

 def check(self, url_data):
        """Check content for invalid anchors."""
        headers = []
        for name, value in url_data.headers.items():
            if name.lower().startswith(self.prefixes):
                headers.append(name.lower())
        if headers:
            items = [u"%s=%s" % (name.capitalize(), url_data.headers[name]) for name in headers]
            info = u"HTTP headers %s" % u", ".join(items)
            url_data.add_info(info)

valerio65xz commented 7 years ago

Ok perfect now that's work!! Thank you so much :)

anarcat commented 7 years ago

@seamang could you send your changes as a pull request? even though there's no activity here, at least we can keep track of it when / if we fork..