webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
631 stars 82 forks source link

Overwrite flag results in no such files error #213

Open DriesVanbilloen opened 1 year ago

DriesVanbilloen commented 1 year ago

When adding the --overwrite flag to the command, you get the following error:

crawl --url https://ipa2-f.kbc.be/particulieren/nl.html  --limit 1 --generateWACZ --text --headless true --collection AEM --overwrite
Set netIdleWait to 2 seconds
wb-manager init failed, collection likely already exists
Clearing /crawls/collections/AEM before starting
Storing state in memory
pages/pages.jsonl creation failed [Error: ENOENT: no such file or directory, mkdir '/crawls/collections/AEM/pages'] {
  errno: -2,  2023-01-24 11:09:25.126
  code: 'ENOENT',3-01-24 11:09:28.106 (running for 3.0 seconds)
  syscall: 'mkdir', (100.00%), errors: 0 (0.00%)
  path: '/crawls/collections/AEM/pages'
}= Sys. load: 63.8% CPU / 36.7% memory
pages/pages.jsonl append failed TypeError: Cannot read properties of null (reading 'writeFile')
    at Crawler.writePage (/app/crawler.js:977:26)
    at Crawler.crawlPage (/app/crawler.js:392:18)r 18.9 seconds)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async /app/node_modules/puppeteer-cluster/dist/util.js:63:24
    at async Object.timeoutExecute (/app/node_modules/puppeteer-cluster/dist/util.js:54:20)
    at async Worker.handle (/app/node_modules/puppeteer-cluster/dist/Worker.js:48:22)
    at async Cluster.doWork (/app/node_modules/puppeteer-cluster/dist/Cluster.js:250:24)

I think it's because the collection directory is not created when adding the overwrite flag. Not sure

tw4l commented 1 year ago

Thanks @DriesVanbilloen ! It looks like the issue is that the collection is getting deleted because of --overwrite after wb-manager init is run, so the collection is never re-created. I will submit a PR with a fix shortly.