webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
611 stars 79 forks source link

Issue crawling a web property with big PDFs #676

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 weeks ago

We (Kiwix) are struggling to crawl https://www.survivorlibrary.com/index.php/main-library-index/.

Problem is linked to big files.

We have already setup include rule to include only PDFs and exclude ZIP files which are known to be too big, but we now realize there is even huge PDFs.

For instance, problem occurs when crawling https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a ~837MB file.

Logs:

{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf"}}
{"timestamp":"2024-09-02T14:18:25.817Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2332,"total":14591,"pending":1,"failed":2,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-09-02T14:18:25.673Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.survivorlibrary.com\\/library\\/corn_and_corn_improvement_1955.pdf\",\"added\":\"2024-09-02T12:33:36.057Z\",\"depth\":2}"]}}
{"timestamp":"2024-09-02T14:18:55.819Z","logLevel":"warn","context":"fetch","message":"Direct fetch capture attempt timed out","details":{"seconds":30,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:55.820Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:18:58.268Z","logLevel":"warn","context":"recorder","message":"Large streamed written to WARC, but not returned to browser, requires reading into memory","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","actualSize":893091154,"maxSize":5000000}}
{"timestamp":"2024-09-02T14:19:33.248Z","logLevel":"info","context":"writer","message":"Rollover size exceeded, creating new WARC","details":{"size":1258552021,"oldFilename":"rec-36c67859a2e1-20240902141737592-0.warc.gz","newFilename":"rec-36c67859a2e1-20240902141933248-0.warc.gz","rolloverSize":1000000000,"id":"0"}}
{"timestamp":"2024-09-02T14:20:13.553Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"warn","context":"recorder","message":"Failed to load response body","details":{"url":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","networkId":"7FFA1047E29E7008029AD4F1593A6C48","type":"exception","message":"Protocol error (Fetch.getResponseBody): Target closed","stack":"TargetCloseError: Protocol error (Fetch.getResponseBody): Target closed\n    at CallbackRegistry.clear (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)\n    at CdpCDPSession._onClosed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)\n    at #onClose (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:163:21)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:43:30)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)\n    at WebSocket.onClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:220:9)\n    at WebSocket.emit (node:events:519:28)\n    at WebSocket.emitClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:272:10)\n    at Socket.socketOnClose (/app/node_modules/puppeteer-core/node_modules/ws/lib/websocket.js:1341:15)\n    at Socket.emit (node:events:519:28)","page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.554Z","logLevel":"error","context":"general","message":"Page Load Failed, skipping page","details":{"msg":"Protocol error (Page.navigate): Target closed","loadState":0,"page":"https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf","workerid":0}}
{"timestamp":"2024-09-02T14:20:13.609Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-09-02T14:20:13.754Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmpo992w83h/collections/crawl-20240902123019192/crawls/crawl-20240902142013-36c67859a2e1.yaml","details":{}}
{"timestamp":"2024-09-02T14:20:13.760Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2333,"total":14591,"pending":0,"failed":3,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-09-02T14:20:13.761Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

It looks like the direct fetch times-out, which is not a surprise, since its value is 30s. Crawler then tries to download it "normally", but the a puppeteer disconnection occurs after some time.

Note that we have other big files where the direct download timed-out but the "normal crawl" achieved to proceed.

I do not reproduce the problem on my machine, so it is probably linked to some memory issue or whatever linked to the machine where the bug occurs.

For the record, the command I used to try to reproduce the problem (and big PDF unfortunately downloads properly):

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0-beta.0 crawl --url "https://www.survivorlibrary.com/index.php/Farming_Corn" --cwd /output --depth 1 --scopeType host

Even if we could say that we should run this crawl on another machine, I wonder if it wouldn't make more sense to be able to customize the FETCH_TIMEOUT_SECS with a CLI flag, so that the direct download does not fails just due to a timeout when we know that it has good reasons to take long to complete? Would it have any adverse side-effect (aside from risking that the crawl takes long time to detect a direct fetch which is really in timeout)?

benoit74 commented 2 weeks ago

Edit : in fact, it looks like the PDF has been downloaded in fact, at least we find it in the WARC/ZIM, so the crash happens right after the file is saved to the WARC.

I achieved to reproduce the problem locally by increasing the "normal" timeout, my previous attempt probably just timed-out after 90s so the file was not really downloaded.

Proper command to be "sure" to reproduce the bug (note the --timeout 9999, 9999 is a bit silly but at least "sufficient"):

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0-beta.0 crawl --url "https://www.survivorlibrary.com/index.php/Farming_Corn" --cwd /output --depth 1 --scopeType host --timeout 9999
ikreymer commented 2 weeks ago

Thanks for the consistently reproducible command! I think the simplest thing to do is to not limit fetch() to 30 seconds, that doesn't seem right, it should probably just be the same timeout, or, can try to be more clever and consider that the timeout to first byte received, rather than completion.. We definitely don't need to be using the browser to load a 800MB+ PDF! But, should also figure out why it is crashing / disconnecting here, since its not trying to render it, but downloading it to /dev/null in fact...

ikreymer commented 2 weeks ago

Have a fix in #677