yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.4k stars 426 forks source link

Some sites does not let me crawl beacause of robot.txt. How to ignore the robot.txt, and force crawl that site. #275

Closed tangdou1 closed 5 years ago

gangarer commented 5 years ago

Unchecking the "Obey html-robots-noindex:" box at the advanced crawler page should do it.

tangdou1 commented 5 years ago

I test it. It does not work.

ZeroCool940711 commented 4 years ago

I know this is an old issue but I also have this problem, instead of opening a new issue for the same thing I decided to comment here.

My issue is the same as @tangdou1 I can not crawl some websites because of the robot.txt, even after unchecking "Obey html-robots-noindex:" I still can not crawl the website, I tried another search engine (Open Semantic Search) with the same domain and it was able to crawl it as it doesnt obey robot.txt, is it possible to force the robot.txt to be ignored globally on yacy?

onlyjob commented 4 years ago

Is it even ethical to ignore robot.txt? Site owners may have good reasons to exclude portions of their web sites from crawlers. On some instances it may be for your own good in situations when volatile pages with infinite recursion are excluded. As a sysadmin, I'm telling you, that if you violate my robots.txt then I'm likely to block your crawler entirely based on your IP address, User-Agent or both. No respecting robots.txt can get your entire country blocked and what are you going to do then? Just contact the site owners and convince them to allow you to crawl what you want to crawl.

ZeroCool940711 commented 4 years ago

I thought the whole point of Yacy was to give users freedom and bypass censorship that other search engines like google and bing enforce on users. We should have the choice to decide whether or not we want to follow and obey robot.txt files.

Also as I said before other search engines like Open Semantic Search let us ignore robot.txt files, and I dont think there is a website owner stupid enought to block a whole country because of one user do not follow the robot.txt file on their site, you can get ip blocked or hardware blocked and that would be okay, in that case it should be my problem as the one indexing your website and I should take care of how many requests I do to your website or other ways to avoid getting blocked, it should not be enforced on the search engine itself.

frankenstein91 commented 4 years ago

Glad you're reopening that discussion, I vote to remove the checkmark. Your exaggeration is also completely wrong. The robots.txt is not censorship. The information is still freely accessible, the creator just decided that you should not save it on your computer. Whenever you can't stand above information ownership rights, it's instant censorship for you? In that case, I'd like to have your admin page url. I think this would be information that you would consider worth protecting. Since you do not allow others to protect the information, I think you should set a good example and publish your information.

onlyjob commented 4 years ago

@ZeroCool940711, robots.txt is not for censorship so there is no need to circumvent it.

Please be nice to web sites that ask you not to crawl certain parts of the web site. You are unlikely to harvest much useful information but quite probably crawl some dynamic portions of the web site or its API interface, etc. With dynamic site content excluded by robots.txt your index may be contaminated by 404 URLs that link nowhere and that is just one example.

Maintaining good quality index often requires you to exclude things from crawling rather than add more. For example some wordpress sites don't exclude /wp-json/ area in robots.txt and when accidentally crawled it pollutes index with unusable unreadable (by human) links that should be manually thrown away.

Maybe you have legitimate case when something valuable is behind robots.txt shield but letting the site owners know about the problem is better than bypassing the exclusion.

ZeroCool940711 commented 4 years ago

Sites like boorus, imageboards, coding sites like github and other sites have global robot.txt that do not let you index anything from their websites and not because some sensitive information is there, if it were a path like an /admin page or user information I would agree that those things should not be indexed but things like imageboards or code should be allowed to be indexed, thats when been able to bypass robot.txt is useful.

frankenstein91 commented 4 years ago

The problem can be solved quite simply. We are a project on Github and Open Source. Why don't you become a supporter of our project and let us join Github? The explanation is in the file https://github.com/robots.txt.

If you would like to crawl GitHub contact us at support@github.com. We also provide an extensive API: https://developer.github.com/