webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
662 stars 84 forks source link

Crawl shows error and exits if option `--urlFile` is used without setting `--scope` #61

Closed sebastian-nagel closed 3 years ago

sebastian-nagel commented 3 years ago

Crawl fails if called with --urlFile but without --scope:

$> docker run -v$PWD/test-urls.txt:/test-urls.txt webrecorder/browsertrix-crawler:0.4.0-beta.1 crawl --urlFile /test-urls.txt
Exclusions Regexes:  []
Scope Regexes:  undefined
creating pages without full text
Queuing Error:  TypeError: Cannot read property 'length' of undefined
    at Crawler.shouldCrawl (/app/crawler.js:854:43)
    at Crawler.queueUrls (/app/crawler.js:758:33)r 4.0 seconds)
    at Crawler.extractLinks (/app/crawler.js:752:10)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async Crawler.loadPage (/app/crawler.js:735:7)
    at async Crawler.module.exports [as driver] (/app/defaultDriver.js:4:3)
    at async Crawler.crawlPage (/app/crawler.js:570:7)

Built from ae4ce97. See #55: I can confirm that 0.4.0-beta.0 (from hub.docker.com) is not affected.

sebastian-nagel commented 3 years ago

Actually, the crawl does not really fail - it stops following links when the error happens.

ikreymer commented 3 years ago

Thanks, this should be fixed in 0.4.0-beta.1 and higher. In latest, scope can now be specified per seed as well.

ikreymer commented 3 years ago

Closing for now as it should be fixed. --scope now renamed to --scopeIncludeRx or --include for consistency