webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
657 stars 83 forks source link

WARC record HTTP status code is 0 instead of 200 #570

Closed benoit74 closed 6 months ago

benoit74 commented 6 months ago

On some occasion, we have WARC archives where the HTTP status code is 0 instead of 200.

When looking inside the WARC, we see that we indeed have a HTTP Header whose value is HTTP/1.1 0 OK

Sample command to quickly reproduce the problem (only 53 pages are fetched):

docker run -v $PWD/output:/output --name zimit2 --rm  webrecorder/browsertrix-crawler:1.1.1 crawl --failOnFailedSeed --behaviors "autoplay,autofetch,autoscroll" --url "https://journals.openedition.org/bibnum/889" --scopeType host --mobileDevice "Pixel 2" --cwd /output --combineWARC --depth 1

Details about problematic WARC record:

### REC Headers ###
WARC/1.1
WARC-Page-ID: 5910275c-800e-4687-b4f1-04bcc965406a
WARC-Resource-Type: document
WARC-JSON-Metadata: {"ipType":"Public","cert":{"issuer":"R3","ctc":"0"}}
WARC-Target-URI: https://journals.openedition.org/bibnum/pdf/889
WARC-Date: 2024-05-14T14:39:43.291Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:c484334e-3c77-43dc-b49d-a70df8f72ddf>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:3318b4cb23862b565cf745ba42415824d6bc477b0645525df764f5e580e28ca7
WARC-Block-Digest: sha256:c30ff1dd4f4c7063abadef817a65a1ce7ed5d3daf389bd44ffacf59e71c50d0d
Content-Length: 620680

### HTTP Headers ###
HTTP/1.1 0 OK
Accept-Ranges: bytes
Age: 0
Connection: keep-alive
Content-Length: 620314
Content-Type: application/pdf
Content-disposition: attachment; filename="bibnum-889.pdf"
Content-transfer-encoding: binary
Date: Tue, 14 May 2024 14:39:43 GMT
Via: 1.1 varnish (Varnish/6.5)
X-Backend: journals3
X-Served-By: frontwebjournals1
X-Varnish: 687682549

Doing a curl on the same URL https://journals.openedition.org/bibnum/pdf/889 does not gives a 0 status code.

At first, I thought it was linked to the fact that it is a direct fetch, but this is not the case in fact, it is considered as a real page. Maybe a problem linked to the fact that it is a PDF and hence rendered differently by the browser (usually downloaded instead of being displayed)?

Is this normal / expected behavior or a bug?

benoit74 commented 6 months ago

Thank you, I confirm it is working now, thank you for the quick fix and release!