yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.38k stars 428 forks source link

Scrape cannot load UR 403 forbidden #298

Open vasyugan opened 5 years ago

vasyugan commented 5 years ago

I tried to index www.democracynow.org and it reproducibly fails with the message: Crawling of "https://www.democracynow.org" failed. Reason: scraper cannot load URL: REJECTED EMPTY RESPONSE BODY 'HTTP/1.1 403 Forbidden' for URL 'https://www.democracynow.org/'$/

okybaca commented 11 months ago

I suspect the CDN or robot protections to cut of the crawlers as discused in the forum. Most probably it's not error of YaCy, but the strict crawler policy of sites themeselves. Sometimes it helps to change the crawlers "user agent". Maybe more options of user-agent to choose (reflecting the actual other robots user-agents) added to YaCy would help.