Open vasyugan opened 5 years ago
I suspect the CDN or robot protections to cut of the crawlers as discused in the forum. Most probably it's not error of YaCy, but the strict crawler policy of sites themeselves. Sometimes it helps to change the crawlers "user agent". Maybe more options of user-agent to choose (reflecting the actual other robots user-agents) added to YaCy would help.
I tried to index www.democracynow.org and it reproducibly fails with the message: Crawling of "https://www.democracynow.org" failed. Reason: scraper cannot load URL: REJECTED EMPTY RESPONSE BODY 'HTTP/1.1 403 Forbidden' for URL 'https://www.democracynow.org/'$/