Closed gitreich closed 5 months ago
Copying comment from #563:
I think this may be as expected. In the 1.x releases, if --failOnInvalidStatus
is not set, the crawler doesn't consider 4xx/5xx responses failures, so they wouldn't trigger --failOnFailedLimit
. And without --failOnFailedSeed
, we allow seeds to fail/be non-existent as long as at least one of them resolves to something (even if it's a 4xx/5xx response, so long as --failOnInvalidStatus
isn't set).
This is a change from 0.x but allows us to be more flexible and precise with what behavior is expected.
The change around 4xx/5xx pages is also useful for QA, as in the past we'd consider 4xx/5xx pages as failures always, but sometimes we may actually want to capture that content, e.g. a custom 404 page, or otherwise include the page in the page list in Browsertrix for a crawl with the status code it returned instead of just skipping it.
So for your use case I think --failOnFailedSeed
and --failOnInvalidStatus
together should work well, or --failOnInvalidStatus
and --failOnFailedLimit 1
if you want the crawl to mark the crawl a failure if any page (not just a seed) returns a 4xx/5xx status code.
Hm, it does look like there might be an issue with the latter (--failOnInvalidStatus
+ --failOnInvalidLimit 1
) without --failOnFailedSeed
Hm, it does look like there might be an issue with the latter (
--failOnInvalidStatus
+--failOnInvalidLimit 1
) without--failOnFailedSeed
Looks like we weren't awaiting the result of crawlState.numFailed()
! Fix and test coming shortly.
Found by retesting: #563 expectation: Giving --failOnFailedLimit 1 should exit the crawl (ExitCode is unclear) if one Seed returns failed (e.q. 2 not existing seeds -> after 1st seed crawl is aborted)
TestCase: mixing seeds to generate different order of invalid state and see if the crawl is ending after a failed seed.
Result: The Crawl was in no case exited and was always going to the last seed in the seed file but definition of --failOnFailedLimit: If set, save state and exit if number of failed pages exceeds this value
Docker start was: docker run -d --name ONB_Btrix_invalid_urls_20240516105232 -e NODE_OPTIONS="--max-old-space-size=32768" -p 9786:9786 -p 18902:18902 -v /home/antares/Schreibtisch/Docker/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.1.2 crawl --screencastPort 9786 --seedFile /crawls/config/invalid_urls_seeds.txt --scopeType prefix --depth 3 --extraHops 0 --workers 1 --healthCheckPort 18902 --headless --failOnInvalidStatus --failOnFailedLimit 1 --delay 1 --waitUntil networkidle0 --postLoadDelay 1 --saveState always --limit 7 --logging stats,info --warcInfo ONB_CRAWL_invalid_urls_Depth_3_20240516105232 --userAgentSuffix +ONB_Bot_Btrix_1.1.2, webarchiv@onb.ac.at --crawlId id_ONB_CRAWL_invalid_urls_Depth_3_20240516105232 --collection invalid_urls_20240516105232
Seed List (ServerResponse): https://www.volksstimme.at/nichtda.php (500) https://www.volksstimme.at/mehrnix(500) https://thomaswaitz.eu/nixda (404) https://thomaswaitz.eu/aaasssooo (404) https://machwasnichtdaist.de (0) https://nixda.tt (0)
Log File: {"timestamp":"2024-05-16T08:52:34.514Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.2 (with warcio.js 2.2.1)","details":{}} {"timestamp":"2024-05-16T08:52:34.516Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.volksstimme.at/nichtda.php","scopeType":"prefix","include":["/^https?:\\/\\/www\\.volksstimme\\.at\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://www.volksstimme.at/mehrnix","scopeType":"prefix","include":["/^https?:\\/\\/www\\.volksstimme\\.at\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://thomaswaitz.eu/nixda","scopeType":"prefix","include":["/^https?:\\/\\/thomaswaitz\\.eu\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://thomaswaitz.eu/aaasssooo","scopeType":"prefix","include":["/^https?:\\/\\/thomaswaitz\\.eu\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://machwasnichtdaist.de/","scopeType":"prefix","include":["/^https?:\\/\\/machwasnichtdaist\\.de\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3},{"url":"https://nixda.tt/","scopeType":"prefix","include":["/^https?:\\/\\/nixda\\.tt\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":3}]} {"timestamp":"2024-05-16T08:52:34.569Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 18902","details":{}} {"timestamp":"2024-05-16T08:52:35.316Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}} {"timestamp":"2024-05-16T08:52:35.318Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}} {"timestamp":"2024-05-16T08:52:35.523Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.volksstimme.at/nichtda.php"}} {"timestamp":"2024-05-16T08:52:35.526Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":6,"pending":1,"failed":0,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-05-16T08:52:35.321Z\",\"extraHops\":0,\"url\":\"https://www.volksstimme.at/nichtda.php\",\"added\":\"2024-05-16T08:52:34.645Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:35.856Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.volksstimme.at/nichtda.php","workerid":0}} {"timestamp":"2024-05-16T08:52:36.076Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.volksstimme.at/nichtda.php","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","page":"https://www.volksstimme.at/nichtda.php","workerid":0}} {"timestamp":"2024-05-16T08:52:37.183Z","logLevel":"error","context":"general","message":"Page Crashed on Load","details":{"status":500,"page":"https://www.volksstimme.at/nichtda.php","workerid":0}} {"timestamp":"2024-05-16T08:52:37.219Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":1,"page":"https://www.volksstimme.at/nichtda.php","workerid":0}} {"timestamp":"2024-05-16T08:52:37.259Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/invalid_urls_20240516105232/crawls/crawl-20240516085237-id_ONB_CRAWL_invalid_urls_Depth_3_20240516105232.yaml","details":{}} {"timestamp":"2024-05-16T08:52:37.425Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.volksstimme.at/mehrnix"}} {"timestamp":"2024-05-16T08:52:37.426Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":6,"pending":1,"failed":1,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":1,\"started\":\"2024-05-16T08:52:37.271Z\",\"extraHops\":0,\"url\":\"https://www.volksstimme.at/mehrnix\",\"added\":\"2024-05-16T08:52:34.646Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:37.647Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.volksstimme.at/mehrnix","workerid":0}} {"timestamp":"2024-05-16T08:52:37.745Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.volksstimme.at/mehrnix","errorText":"net::ERR_HTTP_RESPONSE_CODE_FAILURE","page":"https://www.volksstimme.at/mehrnix","workerid":0}} {"timestamp":"2024-05-16T08:52:38.853Z","logLevel":"error","context":"general","message":"Page Crashed on Load","details":{"status":500,"page":"https://www.volksstimme.at/mehrnix","workerid":0}} {"timestamp":"2024-05-16T08:52:38.863Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":1,"page":"https://www.volksstimme.at/mehrnix","workerid":0}} {"timestamp":"2024-05-16T08:52:39.038Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://thomaswaitz.eu/nixda"}} {"timestamp":"2024-05-16T08:52:39.039Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2,"total":6,"pending":1,"failed":2,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":2,\"started\":\"2024-05-16T08:52:38.890Z\",\"extraHops\":0,\"url\":\"https://thomaswaitz.eu/nixda\",\"added\":\"2024-05-16T08:52:34.647Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:39.414Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://thomaswaitz.eu/nixda","workerid":0}} {"timestamp":"2024-05-16T08:52:41.356Z","logLevel":"error","context":"general","message":"Page Invalid Status","details":{"status":404,"page":"https://thomaswaitz.eu/nixda","workerid":0}} {"timestamp":"2024-05-16T08:52:41.365Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":1,"page":"https://thomaswaitz.eu/nixda","workerid":0}} {"timestamp":"2024-05-16T08:52:41.537Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://machwasnichtdaist.de/"}} {"timestamp":"2024-05-16T08:52:41.538Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3,"total":6,"pending":1,"failed":3,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":4,\"started\":\"2024-05-16T08:52:41.379Z\",\"extraHops\":0,\"url\":\"https://machwasnichtdaist.de/\",\"added\":\"2024-05-16T08:52:34.648Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:41.602Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://machwasnichtdaist.de/","workerid":0}} {"timestamp":"2024-05-16T08:52:41.635Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://machwasnichtdaist.de/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://machwasnichtdaist.de/","workerid":0}} {"timestamp":"2024-05-16T08:52:41.639Z","logLevel":"error","context":"general","message":"Page Load Timeout, skipping page","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://machwasnichtdaist.de/","page":"https://machwasnichtdaist.de/","workerid":0}} {"timestamp":"2024-05-16T08:52:41.650Z","logLevel":"warn","context":"pageStatus","message":"Page date missing, setting to now","details":{"url":"https://machwasnichtdaist.de/","ts":"2024-05-16T08:52:41.650Z"}} {"timestamp":"2024-05-16T08:52:41.651Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://machwasnichtdaist.de/","workerid":0}} {"timestamp":"2024-05-16T08:52:41.811Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://thomaswaitz.eu/aaasssooo"}} {"timestamp":"2024-05-16T08:52:41.812Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":4,"total":6,"pending":1,"failed":4,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":3,\"started\":\"2024-05-16T08:52:41.662Z\",\"extraHops\":0,\"url\":\"https://thomaswaitz.eu/aaasssooo\",\"added\":\"2024-05-16T08:52:34.648Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:42.184Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://thomaswaitz.eu/aaasssooo","workerid":0}} {"timestamp":"2024-05-16T08:52:43.944Z","logLevel":"error","context":"general","message":"Page Invalid Status","details":{"status":404,"page":"https://thomaswaitz.eu/aaasssooo","workerid":0}} {"timestamp":"2024-05-16T08:52:43.953Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":1,"page":"https://thomaswaitz.eu/aaasssooo","workerid":0}} {"timestamp":"2024-05-16T08:52:44.121Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://nixda.tt/"}} {"timestamp":"2024-05-16T08:52:44.122Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":5,"total":6,"pending":1,"failed":5,"limit":{"max":7,"hit":false},"pendingPages":["{\"seedId\":5,\"started\":\"2024-05-16T08:52:43.967Z\",\"extraHops\":0,\"url\":\"https://nixda.tt/\",\"added\":\"2024-05-16T08:52:34.649Z\",\"depth\":0}"]}} {"timestamp":"2024-05-16T08:52:44.158Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://nixda.tt/","workerid":0}} {"timestamp":"2024-05-16T08:52:44.193Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://nixda.tt/","errorText":"net::ERR_NAME_NOT_RESOLVED","page":"https://nixda.tt/","workerid":0}} {"timestamp":"2024-05-16T08:52:44.197Z","logLevel":"error","context":"general","message":"Page Load Timeout, skipping page","details":{"msg":"net::ERR_NAME_NOT_RESOLVED at https://nixda.tt/","page":"https://nixda.tt/","workerid":0}} {"timestamp":"2024-05-16T08:52:44.210Z","logLevel":"warn","context":"pageStatus","message":"Page date missing, setting to now","details":{"url":"https://nixda.tt/","ts":"2024-05-16T08:52:44.210Z"}} {"timestamp":"2024-05-16T08:52:44.210Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://nixda.tt/","workerid":0}} {"timestamp":"2024-05-16T08:52:44.220Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}} {"timestamp":"2024-05-16T08:52:44.333Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/invalid_urls_20240516105232/crawls/crawl-20240516085244-id_ONB_CRAWL_invalid_urls_Depth_3_20240516105232.yaml","details":{}} {"timestamp":"2024-05-16T08:52:44.336Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":6,"total":6,"pending":0,"failed":6,"limit":{"max":7,"hit":false},"pendingPages":[]}} {"timestamp":"2024-05-16T08:52:44.337Z","logLevel":"info","context":"general","message":"Crawling done","details":{}} {"timestamp":"2024-05-16T08:52:44.338Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}