webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

warcio accepts a bare LF everywhere a CRLF is required by the spec #157

Closed acidus99 closed 1 year ago

acidus99 commented 1 year ago

warcio check (and possibly other modes for warcio) silently accepts a bare LF anywhere a CRLF is required per the WARC spec. This includes:

See the attached WARC. (Github won't let me upload a plain WARC, so to keep things simple, this is just a gzipped file, not a per-record gzipped WARC) only-lf-endings.warc.gz

Run warcio check only-lf-endings.warc and you get no warnings or errors. However look at the WARC in a hex editor.

Here we can see the bare LF between the fields in a record's header, instead of the required CRLFs:

image

Here we can see a completely invalid LF CR LF separating the warcinfo record's header from its block:

image

Finally, we can see just two LFs between the warcinfo record and the next record:

image

This is a contrived example for simplicity. I found this bug while troubleshooting a WARC which had a record with an incorrect Content-Length field. The Content-Length value was 1 larger than the actual block length, which caused warcio to include the first CR of the CR LF CR LF separator in the block bytes. This did trigger a "block digest failed" warning. Warcio then read the remaining LF CR LF and parsed that as a valid record separator, and continued parsing the rest of the records as normal. This was super confusing because other WARC parsing code/libraries were throwing an error about a wrong content length, while warcio check said everything was fine 😂 Took me a while to track down the problem!

Expected Behavior: warcio should not accept a bare LF where the spec requires a CRLF. warcio should report any missing CRLFs as errors.

acidus99 commented 1 year ago

Per the discussion with @wumpus and @ikreymer in #158, warcio check is not intended to be a WARC validator / linter. Detecting malformed or illegal characters is more appropriate for the proposed warcio test feature in progress at #66.