yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.41k stars 428 forks source link

Crawling does not start (has workaround) #558

Open ArneBab opened 1 year ago

ArneBab commented 1 year ago

I built yacy from sources and started it with Java 8. Then I added crawls for two websites (https://www.draketo.de and https://www.1w6.org). But when I look at http://localhost:8090/Crawler_p.html I see no crawling activity.

A hint I found were warnings about high system load (which is correct: I have high system load):

I 2023/02/08 20:06:00 SWITCHBOARD * postprocessing deactivated: field process_sxt is not enabled
I 2023/02/08 20:06:00 SWITCHBOARD * postprocessing deactivated: too high load (13.65) > 2.5, to force change field postprocessing.maximum_load
I 2023/02/08 20:06:00 SWITCHBOARD * postprocessing deactivated: constraints violated

In http://localhost:8090/Crawler_p.html I see

idle    00:00   
pending:    collection=0    webgraph=0   
Traffic (Crawler)   0.11 MB      
Load    16,45    

Is there something I need to do to get Yacy to crawl my pages?

I waited for more than one hour (I had consistently high load during that time).

okybaca commented 1 year ago

There are resource limiting mechanisms in YaCy, which limit some functions after certain load level is reached. Load of 16.45 would be insane on a common machine, but as load is relative to a number of CPUs, on 18-CPUs computer, load of 16.45 is still low (load of 18 would mean that all the processors have work and no processes are waiting). These limits can be set in yacy.conf, one mentioned in your example is postprocessing.maximum_load=2.5, you can check 50_localcrawl_loadprereq=8.0 as well. Did it help?

ArneBab commented 1 year ago

Yes, this got Yacy working for me!

Can this be changed to use the load per core instead of the total load? At least on desktop that’s the more meaningful metric.

okybaca commented 1 year ago

I got no idea. I'd say rather not. Do you know, @Orbiter ?

okybaca commented 1 year ago

The limits should be definitely mentioned in the documentation.