webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
653 stars 83 forks source link

Stuck with all workers idle at the end of a crawl #91

Closed phiresky closed 2 years ago

phiresky commented 3 years ago

Screenshot:

image

Seems that it thinks it needs to crawl another page, but there aren't actually any more pages to crawl. It's been unchanged for multiple hours.

ikreymer commented 3 years ago

thanks for reporting, had not seen this before :( Is this off the 0.4.4 release or the latest 0.5.0 beta on main? If main, then you should be able to interrupt with ctrl+c and it should then stop. Of course this is hard to repro, but will see if it can be done.. The data should still be there at least as its in a volume, and you can convert it to wacz manually..

phiresky commented 3 years ago

This is with 0.5.0-beta (39ddecd35) with requirements.txt modified to add requests[socks] and running through a socks proxy.

Trying to cancel it results in SIGNAL: gracefully finishing current pages... which also seems to hang.

This might be purely because of puppeteer-cluster, but that library seems to be unmaintained. I've set a Chrome debugger into node into the container with docker exec -it containerhash bash and kill -USR1 $(pidof node)

(This crawl is not too important, just in case you can get some useful info out of it)

> await cluster.jobQueue.size()
0 
> cluster.jobQueue.pending
{'{"url":"https://.../","seedId":0,"depth":2,"started":"2021-10-07T14:16:32.043Z"}'}
> cluster.jobQueue.drainMax
10878
> cluster.workersAvail.length
4

So there's one job "pending" but no work being done. I'd guess that neither of the callbacks in queue.shift() is being called image which would cause the item to never be removed from jobQueue.pending

I guess that should be called from cluster.doWork() which gets elements from the queue and runs them, then calls the callback. There's some code in there that takes elements from the queue and then doesn't call the callbacks:

image

So maybe that's the issue - though then that should happen whenever an URL is queued multiple times.

I'll leave it running for now, let me know if you want the result of some other expression.

ikreymer commented 3 years ago

Thanks for helping debug this! For 0.5.0, I am actually using a custom fork of puppeteer-cluster to support a custom job queue that supports async operations: https://github.com/thomasdondorf/puppeteer-cluster/compare/master...ikreymer:async-job-queue The idea was to support running with a redis-backed state, which can then add multiple browsertrix-crawler instances to pick up the work. The internal pending state is shared as well, though perhaps it shouldn't be, will take another look..

But, good catch on the puppeteer cluster skipDuplicateUrls -- looks like that is currently set to true: https://github.com/webrecorder/browsertrix-crawler/blob/main/crawler.js#L357 but should not be, as this is now redundant with the custom job state/queue.

If you do still have it running, could you check if this.duplicateCheckUrls.has(url) is true for that last pending URL? If so, that is likely the issue, and turning off that check should hopefully solve this..

phiresky commented 3 years ago

looks like i killed it by now :/ sorry

ikreymer commented 3 years ago

Switched the skipDuplicateUrls in puppeteer-cluster to false so it should not go down that path. If you get a chance, could you try out the crawl to see if it happens again?

ikreymer commented 2 years ago

Haven't been able to repro this for a while, hopefully the change in skipDuplicateUrls fixes this!