Closed tangdou1 closed 5 years ago
I test it. It does not work.
I know this is an old issue but I also have this problem, instead of opening a new issue for the same thing I decided to comment here.
My issue is the same as @tangdou1 I can not crawl some websites because of the robot.txt, even after unchecking "Obey html-robots-noindex:" I still can not crawl the website, I tried another search engine (Open Semantic Search) with the same domain and it was able to crawl it as it doesnt obey robot.txt, is it possible to force the robot.txt to be ignored globally on yacy?
Is it even ethical to ignore robot.txt
? Site owners may have good reasons to exclude portions of their web sites from crawlers. On some instances it may be for your own good in situations when volatile pages with infinite recursion are excluded. As a sysadmin, I'm telling you, that if you violate my robots.txt
then I'm likely to block your crawler entirely based on your IP address, User-Agent or both. No respecting robots.txt
can get your entire country blocked and what are you going to do then?
Just contact the site owners and convince them to allow you to crawl what you want to crawl.
I thought the whole point of Yacy was to give users freedom and bypass censorship that other search engines like google and bing enforce on users. We should have the choice to decide whether or not we want to follow and obey robot.txt files.
Also as I said before other search engines like Open Semantic Search let us ignore robot.txt files, and I dont think there is a website owner stupid enought to block a whole country because of one user do not follow the robot.txt file on their site, you can get ip blocked or hardware blocked and that would be okay, in that case it should be my problem as the one indexing your website and I should take care of how many requests I do to your website or other ways to avoid getting blocked, it should not be enforced on the search engine itself.
Glad you're reopening that discussion, I vote to remove the checkmark. Your exaggeration is also completely wrong. The robots.txt is not censorship. The information is still freely accessible, the creator just decided that you should not save it on your computer. Whenever you can't stand above information ownership rights, it's instant censorship for you? In that case, I'd like to have your admin page url. I think this would be information that you would consider worth protecting. Since you do not allow others to protect the information, I think you should set a good example and publish your information.
@ZeroCool940711, robots.txt
is not for censorship so there is no need to circumvent it.
Please be nice to web sites that ask you not to crawl certain parts of the web site. You are unlikely to harvest much useful information but quite probably crawl some dynamic portions of the web site or its API interface, etc. With dynamic site content excluded by robots.txt
your index may be contaminated by 404 URLs that link nowhere and that is just one example.
Maintaining good quality index often requires you to exclude things from crawling rather than add more. For example some wordpress sites don't exclude /wp-json/
area in robots.txt
and when accidentally crawled it pollutes index with unusable unreadable (by human) links that should be manually thrown away.
Maybe you have legitimate case when something valuable is behind robots.txt
shield but letting the site owners know about the problem is better than bypassing the exclusion.
Sites like boorus, imageboards, coding sites like github and other sites have global robot.txt that do not let you index anything from their websites and not because some sensitive information is there, if it were a path like an /admin
page or user information I would agree that those things should not be indexed but things like imageboards or code should be allowed to be indexed, thats when been able to bypass robot.txt is useful.
The problem can be solved quite simply. We are a project on Github and Open Source. Why don't you become a supporter of our project and let us join Github? The explanation is in the file https://github.com/robots.txt.
If you would like to crawl GitHub contact us at support@github.com. We also provide an extensive API: https://developer.github.com/
Unchecking the "Obey html-robots-noindex:" box at the advanced crawler page should do it.