webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
607 stars 79 forks source link

Crawl stopping with Node throwing ERR_INVALID_STATE.TypeError #606

Closed ldko closed 2 months ago

ldko commented 2 months ago

I have recently run some crawls that have errored out with the message: "TypeError [ERR_INVALID_STATE]: Invalid state: Controller is already closed". This is being triggered when I try to crawl certain seeds.

The error occurred trying to crawl with any of the following seeds (all respond with redirects, if seeding a crawl with the target location of the redirect, the crawl succeeds):

https://doi.org/10.4236/oalib.1104140
https://doi.org/10.1353/saf.1984.0027
https://tinyurl.com/2p967efh
https://doi.org/10.24972/ijts.2016.35.2.1
https://doi.org/10.1353/hpu.2010.0460
https://doi.org/10.2979/phileduc.1.1.02
https://www.bacb.com/wp-content/uploads/2022/01/Ethics-Code-for-Behavior-Analysts-230119-a.pdf

Reproduce with: docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url "https://www.bacb.com/wp-content/uploads/2022/01/Ethics-Code-for-Behavior-Analysts-230119-a.pdf" --scopeType page --generateWACZ --text --collection test

Logs and error output displays as:

{"timestamp":"2024-06-13T21:13:54.848Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.3 (with warcio.js 2.2.1)","details":{}}
{"timestamp":"2024-06-13T21:13:54.850Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.bacb.com/wp-content/uploads/2022/01/Ethics-Code-for-Behavior-Analysts-230119-a.pdf","scopeType":"page","include":[],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":0}]}
{"timestamp":"2024-06-13T21:13:55.443Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}}
{"timestamp":"2024-06-13T21:13:55.444Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"timestamp":"2024-06-13T21:13:55.678Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.bacb.com/wp-content/uploads/2022/01/Ethics-Code-for-Behavior-Analysts-230119-a.pdf"}}
{"timestamp":"2024-06-13T21:13:55.679Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-13T21:13:55.445Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.bacb.com\\/wp-content\\/uploads\\/2022\\/01\\/Ethics-Code-for-Behavior-Analysts-230119-a.pdf\",\"added\":\"2024-06-13T21:13:54.911Z\",\"depth\":0}"]}}
node:internal/webstreams/readablestream:1069
      throw new ERR_INVALID_STATE.TypeError('Controller is already closed');
            ^

TypeError [ERR_INVALID_STATE]: Invalid state: Controller is already closed
    at ReadableStreamDefaultController.enqueue (node:internal/webstreams/readablestream:1069:13)
    at fetchParams.controller.resume (node:internal/deps/undici/undici:10897:45)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'ERR_INVALID_STATE'
}

Node.js v20.11.1
ikreymer commented 2 months ago

This is actually fixed in main / 1.2.0-beta.0, but can backport to upcoming 1.1.4 release. It looks like using the undici library directly fixes the issue (perhaps newer implementation than in node version?)

ikreymer commented 2 months ago

Fixed in 1.1.4