yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.38k stars 427 forks source link

Detect http return code 429 and slow down Crawler for host and Domain #524

Open pr0vieh opened 1 year ago

pr0vieh commented 1 year ago

if we get an TEMPORARY_NETWORK_FAILURE no response body (http return code = 429) https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429 then automatically slow down the Crawler for this host and retry so many this return code is gone then work with this crawl setting forward... settings to discover: parallel connections based on IP and Domain delay between connections

smokingwheels commented 1 year ago

Are you Running Windows or Linux? Which Version are you using from the Status page?

Pause Crawler and wait till ppm is 0 then have a look at Loader Queue. If there is .iso and such files there you need to upgrade to fix. I currently do an Export to XML file then start the new copy and import to new server. Other ways may not be safe for your yacy server.

Here is output before it was fixed. Initiator Depth Status URL agent-gbs 2 loading http://instantrails.rubyforge.org/svn/trunk/InstantRails-win/InstantRails/mysql/bin/libmySQL.dll</a> agent-gbs 2 loading http://instantrails.rubyforge.org/svn/trunk/InstantRails-win/InstantRails/mysql/bin/libmySQL.dll" Initiator Depth Status URL agent-gbs 3 loading https://download.oracle.com/java/19/latest/jdk-19_linux-aarch64_bin.rpm agent-gbs 3 loading https://www.connecticallc.com/wp-content/themes/connectica/vid/digital-marketing-agency.webm agent-gbs 2 loading https://cdn01.foxitsoftware.com/pub/foxit/datasheet/reader/en_us/Foxit-PDF-Reader.pdf agent-gbs 3 loading https://download.oracle.com/java/19/latest/jdk-19_windows-x64_bin.msi agent-gbs 3 loading https://download.oracle.com/java/19/latest/jdk-19_linux-x64_bin.rpm agent-gbs 3 loading https://www.connecticallc.com/wp-content/themes/connectica/vid/digital-marketing-agency.mp4

https://twitter.com/smokingwheels/status/1579592233032253443

pr0vieh commented 1 year ago

This is not a Bug report this is a future request ! Yacy need to get more "Internet friendly" and Prevent Crawling only One Page 10k times ! (this lead to Network Blocks, http Code 429 and in worst case to Abuse from providers) this can't be a good internet Index then all today relevant big players will stop Massiv Crawling with Cloudflare and co. !

respect Crawl Delays from robots.txt can only be met if we allow 1 request per IP/domain for each Local Crawler! this massiv Slow down Crawling at Local Crawler for sure but can return to the old speed with remote Crawler! more distribution is Needed at all !!

my Local Crawler is crawling 3 domains with 10k pages and this is sadly and slow !!!! why not crawl 10k domains with 1 page ? at once distributed via remote crawl

when an Domain Crawled via remote crawl why not automatically create the founded Urls to remote Crawl for the next one this Opens the "Limiter Crawl Queue(remote Crawl Distributor)" not only to the "last Crawl deep"

pr0vieh commented 1 year ago

i found an reference from google bot for exactly this behavior https://developers.google.com/search/docs/crawling-indexing/reduce-crawl-rate#let-google-reduce-crawl

smokingwheels commented 1 year ago

Are you Running on Windows or Linux?

Do you run a Pihole? https://docs.pi-hole.net/main/basic-install/ My Blocklist are here. https://github.com/smokingwheels/smokingwheels.github.io From what tests I have done it improves crawling, less noise in results.

Yacy is limited to crawl at 120 ppm per domain or IP. Need reference from someone to explain in english please?

Initiator | Depth | Status | URL -- | -- | -- | -- agent-gosogir-w-46 | 5 | loading | https://www.cnbc.com/quotes/NSC