yacy / yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
http://yacy.net
Other
3.41k stars 428 forks source link

Had Java Terminate while crawling 2 times. #509

Open smokingwheels opened 2 years ago

smokingwheels commented 2 years ago

Yacy on Ubuntu 18.04 Java 11 Using my Pihlole DNS server with block lists and a hosts file.

Not sure if these links will work. https://github.com/smokingwheels/smokingwheels.github.io/tree/master/yacy/Logs/yacy/crawler%20crash%202 https://github.com/smokingwheels/smokingwheels.github.io/tree/master/yacy/Logs/yacy/crawler%20crash

I restarted yacy both times and did plenty of crawling ok. Its Live on youtube https://youtu.be/XcHEU0yeubc

smokingwheels commented 2 years ago

Fresh install java unloaded after a while. Last lines of log.

I 2022/10/02 21:10:45 LOADER * CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL https://www.buncyblawg.com/robots.txt I 2022/10/02 21:10:45 LOADER * CRAWLER ..Redirecting request to: https://buncyblawg.com/robots.txt

smokingwheels commented 2 years ago

unloaded about 2 hours of running. Log has lots of warnings

W 2022/10/05 17:39:32 ConcurrentLog * java.io.IOException: org.apache.solr.common.SolrException: Server error writing document id 6sWtjWOJsXRw to the index java.io.IOException: org.apache.solr.common.SolrException: Server error writing document id 6sWtjWOJsXRw to the index at net.yacy.cora.federate.solr.connector.SolrServerConnector.add(SolrServerConnector.java:240) at net.yacy.cora.federate.solr.connector.MirrorSolrConnector.add(MirrorSolrConnector.java:204) at net.yacy.search.index.Fulltext.putDocument(Fulltext.java:377) at net.yacy.search.index.Segment.putDocument(Segment.java:583) at net.yacy.search.index.Segment.storeDocument(Segment.java:666) at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:3505) at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:3419) at net.yacy.search.Switchboard.lambda$new$0(Switchboard.java:1057) at net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:72) at net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.solr.common.SolrException: Server error writing document id 6sWtjWOJsXRw to the index at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:246) at org.apache.solr.update.processor.RunUpdateProcessorFactory$RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:73) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:256) at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:495) at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:336) at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:336) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:222) at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:106) at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:343) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:291) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:338) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:244) at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:303) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:283) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:196) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:131) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:122) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:70) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:82) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2646) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:229) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:194) at net.yacy.cora.federate.solr.connector.SolrServerConnector.add(SolrServerConnector.java:237) ... 14 more Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:877) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:891) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1468) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1464) at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:967) at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:342) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:294) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241) ... 44 more Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=56417869 actual=89ef5f3b (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/1tb/yacygbs/DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1/collection1/data/index/_evt.cfs") [slice=_evt.fdt])) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:547) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.checkIntegrity(CompressingStoredFieldsReader.java:731) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:644) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:228) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:105) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4760) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4364) at org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:201) at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:5923) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:682) I 2022/10/05 17:39:32 SWITCHBOARD * *Indexed 1306 words in URL https://www.twitch.tv/p/en/legal/terms-of-service/ [6sWtjWOJsXRw] Description: Twitch.tv - Terms of Service MimeType: text/html | Charset: ISO-8859-1 | Size: 44438 bytes | LinkStorageTime: 4 ms | indexStorageTime: 3 ms I 2022/10/05 17:39:32 HostQueue * forcing crawl-delay of 135 milliseconds for fit-pc.com: minimumDelta = 25, flux = 0, host.average = 879, robots.delay = 0, ((waitig = 439) - (timeSinceLastAccess = 304)) = 135 I 2022/10/05 17:39:32 HTCACHE * storing content of url https://planet.mysql.com/fr/?tag_search=474, 33696 bytes I 2022/10/05 17:39:32 HTCACHE * storing content of url https://jspm.org/, 14259 bytes

Orbiter commented 2 years ago

could be a double of https://github.com/yacy/yacy_search_server/issues/517

smokingwheels commented 2 years ago

I will test again soon.

could be a double of #517

This might have fixed it.

smokingwheels commented 1 year ago

did upgrade today. yacy_v1.926_202210250010_309adb814 will close for now.

smokingwheels commented 1 year ago

Had java terminate again while crawling. I cant see anything. This is the tail end of the log file.

I 2022/11/11 20:12:40 HTCACHE * storing content of url https://rreview.ams3.digitaloceanspaces.com/wp-content/uploads/2022/10/19085316/Rave_Review_AW22-Product_Model_0063_08_Front-copy_Small-96x144.jpg, 2907 bytes I 2022/11/11 20:12:40 SWITCHBOARD * Not Condensed Resource 'https://rreview.ams3.digitaloceanspaces.com/wp-content/uploads/2022/10/19085316/Rave_Review_AW22-Product_Model_0063_08_Front-copy_Small-96x144.jpg': indexing of media files not wanted by crawl profile I 2022/11/11 20:12:40 SWITCHBOARD * Excluded 5 words in URL https://about.twitter.com/pt/who-we-are/twitter-for-good I 2022/11/11 20:12:40 Fulltext * indexing: 05OhzOAfGlY4 https://about.twitter.com/pt/who-we-are/twitter-for-good I 2022/11/11 20:12:40 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[05OhzOAfGlY4 (1749201726186979328)]} 0 1 I 2022/11/11 20:12:40 SWITCHBOARD * *Indexed 490 words in URL https://about.twitter.com/pt/who-we-are/twitter-for-good [05OhzOAfGlY4] Description: Sobre o Twitter | Twitter for Good MimeType: text/html | Charset: UTF-8 | Size: 8089 bytes | LinkStorageTime: 2 ms | indexStorageTime: 3 ms I 2022/11/11 20:12:40 SWITCHBOARD * Excluded 9 words in URL https://www.ebay.ch/sch/171833/i.html?_nkw=mv I 2022/11/11 20:12:40 Fulltext * indexing: Tt9tYaAh8DDg https://www.ebay.ch/sch/171833/i.html?_nkw=mv I 2022/11/11 20:12:40 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[Tt9tYaAh8DDg (1749201726214242304)]} 0 4 I 2022/11/11 20:12:40 SWITCHBOARD * *Indexed 1226 words in URL https://www.ebay.ch/sch/171833/i.html?_nkw=mv [Tt9tYaAh8DDg] Description: mv in Ersatzteile & Werkzeuge | eBay MimeType: text/html | Charset: UTF-8 | Size: 27253 bytes | LinkStorageTime: 5 ms | indexStorageTime: 2 ms

smokingwheels commented 1 year ago

Had Java unload for some reason. end of log.

W 2022/11/23 14:08:20 LOADER * HTCACHE contained response header, but not content for url https://tr-ex.me/translation/english-indonesian/opening I 2022/11/23 14:08:20 SWITCHBOARD * Excluded 0 words in URL https://www.mitsubishielectric.co.jp/fa/products/rbt/robot/pmerit/others/index.html I 2022/11/23 14:08:20 SWITCHBOARD * CRAWL: ADDED 58 LINKS FROM https://servicemanuals.us/harman-kardon/audio/hk-630-sm8.html, STACKING TIME = 1, PARSING TIME = 5 I 2022/11/23 14:08:20 Fulltext * indexing: 1iQ4K3StEAmo https://www.mitsubishielectric.co.jp/fa/products/rbt/robot/pmerit/others/index.html I 2022/11/23 14:08:20 org.apache.solr.update.processor.LogUpdateProcessorFactory [collection1] webapp=null path=/update params={}{add=[1iQ4K3StEAmo (1750265967827484672)]} 0 0 I 2022/11/23 14:08:20 SWITCHBOARD * *Indexed 128 words in URL https://www.mitsubishielectric.co.jp/fa/products/rbt/robot/pmerit/others/index.html [1iQ4K3StEAmo] Description: SQシリーズ(生産終了機種) 旧製品 製品特長 産業用ロボット MELFA | 三菱電機 FA MimeType: text/html | Charset: UTF-8 | Size: 1493 bytes | LinkStorageTime: 1 ms | indexStorageTime: 1 ms I 2022/11/23 14:08:20 SWITCHBOARD * Excluded 6 words in URL https://servicemanuals.us/harman-kardon/audio/hk-630-sm8.html

smokingwheels commented 1 year ago

I have slowed the crawl speed down to 500 ppm see how it goes.