Closed benoit74 closed 2 days ago
A non-200 status is not always a failure, so we added a separate flag --failOnInvalidStatus
which makes this be considered failures. It's in the docs also: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/argParser.ts#L544 You should add both flags for this behavior.
It looks like under some conditions, even if the seed page returns a 4xx or 5xx HTTP code, the crawler still exits with a normal exit code.
Repro example (looks like 404 is linked to some sort of WAF protection, see maybe repro is possible only from a "datacenter" public IP, maybe not from residential / office public IP):
Resulting WARC:
crawl-failed.warc.gz
Record for seed page:
While this is a single repro, we saw it happen quite significantly recently. Might be a recent regression of 1.3.5 or few version ago.