Closed acidus99 closed 1 year ago
Per the discussion with @wumpus and @ikreymer in #158, warcio check
is not intended to be a WARC validator / linter. Detecting malformed or illegal characters is more appropriate for the proposed warcio test
feature in progress at #66.
warcio check
(and possibly other modes forwarcio
) silently accepts a bare LF anywhere a CRLF is required per the WARC spec. This includes:See the attached WARC. (Github won't let me upload a plain WARC, so to keep things simple, this is just a gzipped file, not a per-record gzipped WARC) only-lf-endings.warc.gz
Run
warcio check only-lf-endings.warc
and you get no warnings or errors. However look at the WARC in a hex editor.Here we can see the bare LF between the fields in a record's header, instead of the required CRLFs:
Here we can see a completely invalid LF CR LF separating the
warcinfo
record's header from its block:Finally, we can see just two LFs between the
warcinfo
record and the next record:This is a contrived example for simplicity. I found this bug while troubleshooting a WARC which had a record with an incorrect Content-Length field. The Content-Length value was 1 larger than the actual block length, which caused warcio to include the first CR of the
CR LF CR LF
separator in the block bytes. This did trigger a "block digest failed" warning. Warcio then read the remainingLF CR LF
and parsed that as a valid record separator, and continued parsing the rest of the records as normal. This was super confusing because other WARC parsing code/libraries were throwing an error about a wrong content length, whilewarcio check
said everything was fine 😂 Took me a while to track down the problem!Expected Behavior: warcio should not accept a bare LF where the spec requires a CRLF. warcio should report any missing CRLFs as errors.