webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
657 stars 83 forks source link

Crawler is not failing when seed page returns an HTTP error code #719

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 weeks ago

It looks like under some conditions, even if the seed page returns a 4xx or 5xx HTTP code, the crawler still exits with a normal exit code.

Repro example (looks like 404 is linked to some sort of WAF protection, see maybe repro is possible only from a "datacenter" public IP, maybe not from residential / office public IP):

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.5 crawl --failOnFailedSeed --behaviors "autoplay,autofetch,autoscroll" --url https://pt.quora.com/https://pt.quora.com/Quero-come%C3%A7ar-carreira-em-Big-Data-AI-Machine-Learning-Por-onde-devo-come%C3%A7ar --mobileDevice "Pixel 2" --cwd /output --combineWARC

Resulting WARC:

crawl-failed.warc.gz

Record for seed page:

### REC Headers ###
WARC/1.1
WARC-Page-ID: 5869a31b-f48d-4a41-886c-8f8d52a372f5
WARC-Resource-Type: document
WARC-JSON-Metadata: {"ipType":"Public","cert":{"issuer":"WR1","ctc":"0"}}
WARC-Target-URI: https://pt.quora.com/https://pt.quora.com/Quero-come%C3%A7ar-carreira-em-Big-Data-AI-Machine-Learning-Por-onde-devo-come%C3%A7ar
WARC-Date: 2024-11-11T08:54:27.021Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:d49a351b-d639-44d6-b2cd-8ed1d9c4c167>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:703062b66f644176dc0fb658753261739ac33be3408c771ffe269c8758294b74
WARC-Block-Digest: sha256:842bb05ca28920a42041a3548215a2b427ce1d938414e47a3690329a4cf5f4e8
Content-Length: 79985

### HTTP Headers ###
HTTP/1.1 404 Not Found
Content-Security-Policy: default-src * data: blob:;style-src * 'unsafe-inline';script-src https://*.quora.com https://*.poe.com https://*.facebook.net https://*.facebook.com https://*.googleapis.com https://*.twitter.com https://*.quoracdn.net https://*.google.com https://*.google-analytics.com https://*.gstatic.com https://*.youtube.com https://*.ytimg.com https://*.jwpcdn.com https://*.stripe.com https://*.intercom.io https://*.intercomcdn.com https://*.syndication.twimg.com https://cdnjs.cloudflare.com https://d3div1mtym39ic.cloudfront.net https://*.jwplatform.com https://*.googlesyndication.com https://*.googletagmanager.com https://*.googleadservices.com https://*.doubleclick.net https://*.googletagservices.com https://*.ampproject.org https://*.amazon-adsystem.com https://*.rubiconproject.com https://*.lijit.com https://*.openx.net https://*.criteo.com https://*.3lift.com https://*.aaxads.com https://btloader.com https://*.btloader.com https://*.ads-twitter.com https://*.awin1.com https://*.dwin1.com https://*.zenaps.com https://*.the.sciencebehindecommerce.com https://*.marketo.net https://*.licdn.com https://*.linkedin.com https://*.qualtrics.com https://*.siteintercept.qualtrics.com https://sc-static.net https://static.bytedance.com https://*.iteratehq.com https://cdn.embedly.com https://qinternal.quora.net https://*.sprig.com https://*.userleap.com https://*.doubleverify.com https://*.adsafeprotected.com https://*.flashtalking.com https://*.samplicio.us https://*.activemetering.com https://*.imrworldwide.com https://*.moatads.com https://*.sng.link https://*.apple.com https://cdn.cookielaw.org https://*.onetrust.com https://*.paypal.com https://*.giphy.com https://*.outbrain.com https://*.outbrainimg.com 'unsafe-inline' 'unsafe-eval' 127.0.0.1:*;connect-src 'self' https://*.quora.com https://*.poe.com https://quora.okta.com wss://*.quora.com https://*.quoracdn.net https://*.stripe.com https://*.intercom.io wss://*.intercom.io https://*.jwplatform.com https://*.jwpsrv.com https://syndication.twitter.com https://*.syndication.twimg.com https://*.googleapis.com https://*.googlesyndication.com https://*.qualtrics.com https://*.facebook.com https://*.fbcdn.net blob: https://*.mktoresp.com https://*.doubleclick.net https://accounts.google.com https://*.amazon-adsystem.com https://*.3lift.com https://*.aaxads.com https://btloader.com https://*.btloader.com https://*.rubiconproject.com https://*.casalemedia.com https://*.adnxs.com https://*.pubmatic.com https://*.openx.net https://*.criteo.com https://*.sharethrough.com https://*.snigelweb.com https://*.trustedstack.com https://*.iteratehq.com https://iteratehq.com https://*.sprig.com https://*.userleap.com https://app.adjust.com https://app.appsflyer.com https://*.onelink.me https://branchster.app.link https://control.kochava.com https://c.singular.net https://*.sng.link https://*.apple.com https://*.doubleverify.com https://*.adsafeprotected.com https://*.flashtalking.com https://*.samplicio.us https://*.activemetering.com https://*.imrworldwide.com https://*.moatads.com https://cdn.cookielaw.org https://*.onetrust.com https://*.paypal.com https://*.linkedin.com https://*.giphy.com https://*.outbrain.com https://*.outbrainimg.com https://d3div1mtym39ic.cloudfront.net ;img-src * data: blob: android-webview-video-poster:;report-uri /security_reports/content_security_policy_violation_3RD_PARTY_POST
alt-svc: h3=":443"; ma=86400
cache-control: private, no-store, max-age=0, no-cache, must-revalidate, post-check=0, pre-check=0
cf-cache-status: DYNAMIC
cf-ray: 8e0d0ec309598d5d-HEL
content-type: text/html; charset=utf-8
cross-origin-opener-policy: same-origin-allow-popups
date: Mon, 11 Nov 2024 08:54:27 GMT
expires: Fri, 01 Jan 1990 00:00:00 GMT
link: <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-27-75ba4e7c2ddc9740.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-vendor-27-ea2465b559af7eae.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-common-27-ede9b6573a4e2714.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-page-StaticPages-27-37a46a689571ebd8.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-common-LoggedOut-27-6f0be492f43cd678.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-common-Mweb-27-16c5e8ebdf46f5f9.webpack>; rel="preload"; as="script", <https://qsbr.cf2.quoracdn.net/-4-ans_frontend-relay-main.css-28-b9ddf59f031b600b.webpack>; rel="preload"; as="style"
pragma: no-cache
server: cloudflare
strict-transport-security: max-age=63072000; includeSubDomains; preload
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-q-stat: ,2912d07a15514931ec5c74c749640445,10.0.0.157,55428,135.181.181.97,,283850989924,1,1731315267.213,0.103,,.,0,0,0.000,0.104,-,0,0,17898,245,122,10,34729,,,,,,-,
x-ua-compatible: IE=Edge, chrome=1
x-xss-protection: 1; mode=block
x-orig-content-encoding: gzip

While this is a single repro, we saw it happen quite significantly recently. Might be a recent regression of 1.3.5 or few version ago.

ikreymer commented 2 weeks ago

A non-200 status is not always a failure, so we added a separate flag --failOnInvalidStatus which makes this be considered failures. It's in the docs also: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/argParser.ts#L544 You should add both flags for this behavior.