Open dlazesz opened 3 years ago
There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If I can see the input file, it would be helpful.
The check_digests API bug is a separate one, I'm not sure how I made that mistake, but I'll open a separate bug for it (#124)
@wumpus
If I can see the input file, it would be helpful.
The input file is attached to the OP and the bug should be reproducible with it: https://github.com/webrecorder/warcio/files/5721385/input.warc.gz
Thank you for investigating the issue!
Thanks, I didn't notice input.warc.gz
was a link.
The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for out.warc. And lo and behold, right at the top, I see
input: HTTP/1.1 200 ^M
# trailing whitespace, no OK
out: HTTP/1.1 200^M
# no trailing whitespace
Which is to say, there's trailing whitespace in the http headers in input and not in out.
How... interesting! I was thinking my digest-checking code was the guilty party, but instead it could be that the copy is dropping that http header trailing whitespace while repeating the same digest? Changing the output by dropping trailing whitespace is dodgy, repeating the digest is much dodgier.
While I'm here I'll also mention that Transfer-Encoding: chunked
is in both input and output headers, but it's not actually chunked. This is a common problem in warcs. warcio happily reads these warcs by falling back to non-chunked.
@ikreymer I see two choices, you probably have an opinion:
Option 1 is a "first, do no harm" philosophy, but it will be a little ugly to notice changes to the headers between read and write. Option 2 is a small code change.
I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):
The output is the following:
The expected behavior would be to raise exception earlier:
The current state makes the user imply that the output.warc.gz is valid until it is re-read.
BTW: The behavior of
check_digests=True
equals tocheck_digests=False
which is not what one would expect.