webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
651 stars 83 forks source link

Stalled after pupeeter error #166

Open rgaudin opened 2 years ago

rgaudin commented 2 years ago

Pretty sure this happened in the past: pupeeter raises an error and in this case the process just hangs forever. Not sure what the optimal behavior would be but at even just crashing/exiting would be better for my use case.

Page Load Failed: https://www.almaany.com/ar/dict/ar-ar/%D8%A3%D9%86%D8%B5%D8%A8%D8%A9/, Reason: Error: Timeout hit: 180000
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
    at CDPSession.send (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:285:35)
    at ExecutionContext._ExecutionContext_evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:210:46)
    at ExecutionContext.evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:106:113)
    at IsolatedWorld.evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/IsolatedWorld.js:174:24)
Page Load Failed: https://www.almaany.com/ar/dict/ar-ar/%D8%AA%D9%8E%D9%86%D9%8E%D8%A7%D8%B5%D9%8F%D8%A8/, Reason: Error: Timeout hit: 180000
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
    at CDPSession.send (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:285:35)
    at ExecutionContext._ExecutionContext_evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:210:46)
    at ExecutionContext.evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:106:113)
    at IsolatedWorld.evaluate (/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/IsolatedWorld.js:174:24)
ikreymer commented 2 years ago

@rgaudin do you have any repro steps? How long did it take for it to happen? Which site? I assume it would take a while, so probably hard to reproduce.

ikreymer commented 2 years ago

Have been able to repro the 'Page Load Failed', but it does exit after that in my test.. hopefully will be able to find it eventually

rgaudin commented 2 years ago
url=https://www.almaany.com/
include=https://www.almaany.com/ar/dict/ar-ar/.*
sizeLimit=4294967296
timeLimit=7200

Failing URL is https://www.almaany.com/ar/dict/ar-ar/أنصبة

I don't know how much time it took to get there


== Start:     2022-09-17 22:05:29.514
== Now:       2022-09-19 08:47:42.655 (running for 1.4 days)
== Progress:  1420 / 10137 (14.01%), errors: 670 (47.18%)
== Remaining: 8.9 days (@ 0.01 pages/second)
== Sys. load: 64.3% CPU / 22.7% memory
== Workers:   1
   #0 WORK https://www.almaany.com/ar/dict/ar-ar/%D8%AA%D9%8E%D9%86%D9%8E%D8%B