yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.43k stars 429 forks source link

Crawling very slow #542

Closed smokingwheels closed 1 year ago

smokingwheels commented 2 years ago

Ubuntu 16.04 Java 11 yacy on hdd yacy has lets see if this works... crawling sometimes resumes.

Queue | Size |   -- | -- | -- Local Crawler | 6,735,930 |   Limit Crawler | 0 |   Remote Crawler | 0 |   No-Load Crawler | 0 |   Loader (1,000) | 1,000 |  
Progress

Queue Size [Local Crawler](http://gsw.undo.it:8090/IndexCreateQueues_p.html?stack=LOCAL) 6,735,930 Pause this queue Limit Crawler 0 Pause this queue [Remote Crawler](http://gsw.undo.it:8090/IndexCreateQueues_p.html?stack=REMOTE) 0 Pause this queue [No-Load Crawler](http://gsw.undo.it:8090/IndexCreateQueues_p.html?stack=NOLOAD) 0 Pause this queue [Loader](http://gsw.undo.it:8090/IndexCreateLoaderQueue_p.html) ([1,000](http://gsw.undo.it:8090/PerformanceQueues_p.html#ThreadPoolSettings)) 1,000 Index Size Database Entries Seg- ments Documents [solr search api](http://gsw.undo.it:8090/solr/select?core=collection1&q=*:*&start=0&rows=3) 9,147,614 33 Webgraph Edges [solr search api](http://gsw.undo.it:8090/solr/select?core=webgraph&q=*:*&start=0&rows=3) 0 0 Citations (reverse link index) 0 0 RWIs (P2P Chunks) 6,232,924 12Progress Indicator Level Speed / PPM (Pages Per Minute) 30000 PPM 0.5 LF 20 MH ([min](http://gsw.undo.it:8090/Crawler_p.html?crawlingPerformance=minimum)/[max](http://gsw.undo.it:8090/Crawler_p.html?crawlingPerformance=maximum)) Crawler PPM 0 Postprocessing Progress idle 00:00 pending: collection=0 webgraph=0 Traffic (Crawler) 4149.99 MB Load`` `I 2022/11/23 10:57:58 WorkTables * executing url: http://localhost:8090/Load_RSS_p.html?url=https://techcommunity.microsoft.com/gxcuf89792/rss/Community?interaction.style=review&indexAllItemContent=&apicall_pk=000000000665 W 2022/11/23 10:57:58 ConcurrentLog * java.net.SocketTimeoutException: Read timed out java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at net.yacy.cora.protocol.http.HTTPClient.GETbytes(HTTPClient.java:456) at net.yacy.cora.protocol.http.HTTPClient.GETbytes(HTTPClient.java:403) at net.yacy.data.WorkTables.execAPICalls(WorkTables.java:358) at net.yacy.search.Switchboard.schedulerJob(Switchboard.java:2467) at net.yacy.search.Switchboard$8.jobImpl(Switchboard.java:1125) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:64) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:215) I 2022/11/23 10:57:58 HostBalancer * (re-)initialized the round-robin queue; 23 hosts.` `THREADS WITH STATES: BLOCKED Thread= CrawlQueues.Loader(https://static-data2.manualslib.com/pdf3/69/6806/680543-cisco/images/dmc250_1_bg.jpg) daemon id=8689 BLOCKED at net.yacy.kelondro.blob.MapHeap.insert(MapHeap.java:176) [synchronized (this) {] at net.yacy.crawler.data.Cache.store(Cache.java:289) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:756) Thread= CrawlQueues.Loader(https://www.manualslib.com/manual/2419741/Silvercrest-Sfw-350-C1.html) daemon id=3931 BLOCKED at java.util.concurrent.locks.ReentrantLock.tryLock(ReentrantLock.java:442) at net.yacy.kelondro.blob.Compressor.insert(Compressor.java:318) at net.yacy.crawler.data.Cache.store(Cache.java:277) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:756) Thread= Switchboard.addToIndex:https://forum.purseblog.com/threads/chanel-18s-emerald-green-advice-needed.1050884/ id=7941 BLOCKED at net.yacy.kelondro.blob.MapHeap.get(MapHeap.java:325) [synchronized (this) {] at net.yacy.kelondro.blob.MapHeap.get(MapHeap.java:263) at net.yacy.crawler.data.Cache.getResponseHeader(Cache.java:341) at net.yacy.repository.LoaderDispatcher.loadFromCache(LoaderDispatcher.java:301) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:218) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.search.Switchboard$21.run(Switchboard.java:3744) Thread= EmbeddedSolrConnector.SolrQueryResponse2SolrDocumentList: 9382507 daemon id=11227 BLOCKED at org.apache.solr.search.LRUCache.computeIfAbsent(LRUCache.java:267) at org.apache.solr.search.SolrDocumentFetcher.doc(SolrDocumentFetcher.java:225) at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:650) at org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:259) at org.apache.solr.handler.component.QueryComponent.doPrefetch(QueryComponent.java:520) at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1523) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:390) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:369) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.query(EmbeddedSolrConnector.java:205) at net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector.getDocumentListByParams(EmbeddedSolrConnector.java:357) at net.yacy.cora.federate.solr.connector.MirrorSolrConnector.getDocumentListByParams(MirrorSolrConnector.java:316) at net.yacy.cora.federate.solr.connector.AbstractSolrConnector.getURL(AbstractSolrConnector.java:483) at net.yacy.search.index.Fulltext.getURL(Fulltext.java:562) at net.yacy.search.Switchboard.getURL(Switchboard.java:1875) at net.yacy.crawler.retrieval.HTTPLoader.createRequestheader(HTTPLoader.java:313) at net.yacy.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:365) at net.yacy.crawler.retrieval.HTTPLoader.load(HTTPLoader.java:85) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:243) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:756) Thread= CrawlQueues.Loader(https://ow.ly/JSCR50Lwvw9) daemon id=9733 BLOCKED at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:732) Thread= Switchboard.addToIndex:https://forum.snapcraft.io/t/request-for-human-review-for-auto-connect-of-interface-personal-files-for-microstack-snap/32593/20 id=9180 BLOCKED at net.yacy.kelondro.blob.MapHeap.insert(MapHeap.java:176) [synchronized (this) {] at net.yacy.crawler.data.Cache.store(Cache.java:289) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.search.Switchboard$21.run(Switchboard.java:3744) Thread= qtp22446425-108-acceptor-1@29fdf6c8-httpd:8090@2416a51{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} id=108 BLOCKED at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:232) at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:388) at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:702) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036) at java.lang.Thread.run(Thread.java:750) Thread= CrawlQueues.Loader(https://saveatrain.com/) daemon id=11645 BLOCKED at net.yacy.kelondro.blob.MapHeap.get(MapHeap.java:325) [synchronized (this) {] at net.yacy.kelondro.blob.MapHeap.get(MapHeap.java:263) at net.yacy.crawler.data.Cache.getResponseHeader(Cache.java:341) at net.yacy.repository.LoaderDispatcher.loadFromCache(LoaderDispatcher.java:301) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:218) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:756) Thread= CrawlQueues.Loader(https://www.vases.lv/sites/default/files/Par_mums/Logo_ISO27001.jpg) daemon id=6771 BLOCKED at java.io.File.length(File.java:985) at net.yacy.kelondro.blob.ArrayStack.length(ArrayStack.java:436) at net.yacy.kelondro.blob.ArrayStack.executeLimits(ArrayStack.java:422) at net.yacy.kelondro.blob.ArrayStack.insert(ArrayStack.java:808) at net.yacy.kelondro.blob.Compressor.flushOne(Compressor.java:428) at net.yacy.kelondro.blob.Compressor.insert(Compressor.java:333) at net.yacy.crawler.data.Cache.store(Cache.java:277) at net.yacy.repository.LoaderDispatcher.loadInternal(LoaderDispatcher.java:274) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:181) at net.yacy.repository.LoaderDispatcher.load(LoaderDispatcher.java:152) at net.yacy.crawler.data.CrawlQueues$Loader.run(CrawlQueues.java:756) `
frankenstein91 commented 1 year ago

as I see a lot of HTTPS requests... please monitor cat /proc/sys/kernel/random/entropy_avail

smokingwheels commented 1 year ago

The Que gets stuck with open connections and large files trying to download them eg tar.gz, mp4 Loader (1,000) | 1,000 |
System YaCy version: 1.924/9000 Uptime: 0 days 07:10 Java version: 11.0.20.1 Processors: 8 Load: 0.3798828125 Threads: 58/21, peak:349, total:13328

Increasing Loader limit helps, then a crawler pause then restart yacy drops the Que.

Indicator
 
Level
 
Speed / PPM
(Pages Per Minute)
      (min/max)
Crawler PPM0
Postprocessing Progress  
idle
00:00
pending:collection=0webgraph=0 
Traffic (Crawler)4149.99 MB 
Load
Loader (9,000) 0