yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.4k stars 427 forks source link

How to remove crawl speed limitation? #648

Closed McXD closed 1 month ago

McXD commented 4 months ago

I am using YaCy to index files stored on my own sites, primarily company filings downloaded from EDGAR. When I started the crawling process, I noticed that the speed is capped at 240 pages per minute (PPM). From the 'Load Web Page, Crawl' page, it states:

No more than four pages are loaded from the same host in one second (not more than 120 documents per minute) to limit the load on the target server.

Since I am crawling my own server, throttling the load is not a concern. How can I remove or adjust this limit to increase the crawl speed?

Any help is appreciated!

Orbiter commented 1 month ago

The limitation is not 240 PPM in general but 240 PPM per host. This is to protect the target host and to omit complaints from the host owners. This is sufficient to load 14400 pages from one host in one hour, that is mostly much more than the host has to offer.

If you make a wide crawl (i.e. 100 hosts at the same time) then the limitation is 24000 pages per minute. That should be enough...

The limitation is NOT there in case you are running this in an intranet. Then you are the owner of the hosts and you can put as much load on it as you want.

Without the limitation YaCy would be a DoS tool. We do not want that YaCy is used for this. Therefore the limitation per host should stay.