webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
657 stars 83 forks source link

Crawler getting stuck on Page Crashed #391

Closed benoit74 closed 6 months ago

benoit74 commented 1 year ago

Kiwix has a crawler which got stuck without returning, with 0.11.1 (i.e. with #385 merged). A last log is output and then process is still up but nothing more seems to be happening.

Launch command (note that I modified the userAgentSuffix):

Running browsertrix-crawler crawl: crawl --failOnFailedSeed --waitUntil load --title Plotly Documentation --depth -3 --timeout 90 --scopeType domain --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --diskUtilization 90 --timeLimit 7200 --url https://plotly.com/python/ --userAgentSuffix this_is_not_public@kiwix.org --cwd /output/.tmpq4knpe8p --statsFilename /output/crawl.json

Version log line:

{"timestamp":"2023-09-20T01:40:03.312Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 0.11.1 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}

Last log line is:

{"timestamp":"2023-09-20T03:01:59.247Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://plotly.com/python/3d-surface-plots/","workerid":0}}

Do not hesitate to ask if more info is needed.

benoit74 commented 1 year ago

youzim.it task: https://farm.youzim.it/pipeline/1d123407-0aa3-4094-8355-c59cd5a41c52

ikreymer commented 1 year ago

Thanks for the report, will try to repro. It should have been able to continue after the page crash.

ikreymer commented 1 year ago

This has hopefully been fixed in 0.11.2 - very hard to be 100% sure, but hopefully won't happen again.

benoit74 commented 9 months ago

Some good and some bad news on this topic.

I confirm the crawler is now continuing after a page crash. That's great.

It however looks like we have new situations (with 0.12.3 and 0.12.4) around page crashes.

Details are present in https://github.com/openzim/zimit/issues/266 and https://github.com/openzim/zimit/issues/283

Help or any suggestion on what to test to progress on this topic would be welcomed. Most important topic for us is probably the new situations of https://github.com/openzim/zimit/issues/283 were the crawler seems to return code 11 while indeed it has faced a critical situation, not a limit. This is a problem for us because we consider that hitting a limit is "normal" and we should continue processing by creating our ZIM. It is more serious than real crawler crashes because we are not alerted of the issue. If it is easy to identify and fix what led the crawler to "believe" it hits a limit, it would be a great enhancement.

One side-question: is it possible to ask the crawler to stop on first page crash (instead of trying to continue)?

benoit74 commented 6 months ago

I confirm that crawler 1.x seems to have solved this issue.

Thank you all for the very great work that has been pushed into 1.x release(s)!