wummel / linkchecker

check links in web documents or full websites
http://wummel.github.io/linkchecker/
GNU General Public License v2.0
1.42k stars 234 forks source link

Can't index a (secured) sharepoint site #127

Closed wummel closed 11 years ago

wummel commented 11 years ago

Converted from SourceForge issue 1323649, submitted by javahollic

I'd like to use linkchecker against a sharepoint based website, it currently gives me a 'Warning, access denied by robots.txt' and setting the same user/password in linkchecker that is required to access the site over http doesnt change things (in fact it doesnt matter if the user/password is wrong). I've tried enabling cookies etc to no effect. I enabled debug cmdline output and see the correct user and password listed...

I can list through one sharepoint server I found: http://sharepoint.bilsimser.com/pages/templates.aspx

But the one I want to test has authority enabled for the index page. SSL is not enabled.

Does linkchecker need to masquerade as a browser? can this error be ignored somehow?

Is this is a bug or am I using it it incorrectly?

wummel commented 11 years ago

Submitted by calvin

Logged In: YES user_id=9205

LinkChecker is a web robot and thus follows the robots.txt access control standard (see [1]). If a site denies access to such robots, as pointed out in the warning you got, then LinkChecker does not access it.

[1] http://www.robotstxt.org/wc/exclusion.html

It is possible to ignore the robots.txt standard, but I will not do that since it would get LinkChecker added to some blacklists for bad behaviour :)

So to your problem: you cannot check sites with LinkChecker that deny access in the robots.txt file. All you can do is ask the site administrator to add LinkChecker to the allowed web robots for the site.