webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
369 stars 58 forks source link

"warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs #156

Open acidus99 opened 1 year ago

acidus99 commented 1 year ago

I'm using WARC files with non-HTTP traffic, specifically the Gemini protocol. I'm setting the WARC-Content-Type appropriately to reflect this.

warcio check has been helpful to find problems with WARCs such as incorrect block digests or records with invalid content lengths. However warcio check is incorrectly reporting payload digest failure on these records:

$ warcio check gemini.warc
gemini.warc
  offset 702 WARC-Record-ID <urn:uuid:c0f9d8fd-d27e-43dc-80be-4fbd864c128d> response
    payload digest failed sha256:20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403

If warcio doesn't understand the protocol defined by a record's WARC-Content-Type field (in this case application/gemini; msgtype=response) it won't understand what constitutes the payload for that record, and thus cannot check the WARC-Payload-Digest field. To my knowledge (and a quick check of the source code) warcio has no concept of the Gemini protocol, so I'm unclear on how it would know what the payload is, and whether the digest is valid or not. Section 6.3.3 of the WARC spec even says the contents of a response record isn't defined for non-HTTP URI schemes.

Perhaps I misunderstand what can be in a payload digest header, but reporting payload digest failures for unknown protocols seems like a bug? At the very least it's cluttering the output.

Attached is an example WARC with a request and response records for Gemini. gemini.warc.gz

Without getting too detailed, Gemini protocol responses contain a single response line with a status code and MIME type, a single CRLF, and then the body of the response. This body is the gemini equivalent of HTTP's entity-body per section 6.3.2. In the WARC example, the body of the response begins at offset 1338 in the uncompressed version of the WARC file (with the '#' character). The body ends at the end of the record, before the final, double CRLF, signifying the end of the record. The sha256 for this body is 20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403 which is used in the payload digest field so I can do deduping and generate indexes.

My suggestion would be that warcio check should not check the payload digest for records whose WARC-Content-Type is an unknown protocol. This would allow future PRs to warcio that support other protocols.

acidus99 commented 1 year ago

FWIW @JustAnotherArchivist touched on parts of this "what is a well-defined payload" here:

93

Perhaps there is another discussion to be had here on "iipc/warc-specifications". I would suggest that payload definition shouldn't be codified into the WARC spec. Tools should be able to work with new protocols and payloads and shouldn't make assumptions about what constitutes a payload for protocls/URIs they don't understand.

wumpus commented 1 year ago

I 100% agree that I didn't take this into account when writing the check code! Thanks for the analysis.