webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
384 stars 58 forks source link

Block digest verification fails on some copied record #123

Open dlazesz opened 3 years ago

dlazesz commented 3 years ago

I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter

print('Validate input')
with open('input.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Negate the full condition for simplicity (Select the problematic record)
        if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
                'https://www.origo.hu/hir-archivum/2019/20190119.html'):
            writer.write_record(record)

print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Select the problematic record
        if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
         'https://www.origo.hu/hir-archivum/2019/20190119.html':
            writer.write_record(record) 

print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):  # This will fail
        pass

The output is the following:

Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    for record in ArchiveIterator(stream, check_digests='raise'):
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
    return self._update(buff)
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
    self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
    raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM

The expected behavior would be to raise exception earlier:

The current state makes the user imply that the output.warc.gz is valid until it is re-read.

BTW: The behavior of check_digests=True equals to check_digests=False which is not what one would expect.

wumpus commented 3 years ago

There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If I can see the input file, it would be helpful.

The check_digests API bug is a separate one, I'm not sure how I made that mistake, but I'll open a separate bug for it (#124)

dlazesz commented 3 years ago

@wumpus

If I can see the input file, it would be helpful.

The input file is attached to the OP and the bug should be reproducible with it: https://github.com/webrecorder/warcio/files/5721385/input.warc.gz

Thank you for investigating the issue!

wumpus commented 3 years ago

Thanks, I didn't notice input.warc.gz was a link.

The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for out.warc. And lo and behold, right at the top, I see

input: HTTP/1.1 200 ^M # trailing whitespace, no OK out: HTTP/1.1 200^M # no trailing whitespace

Which is to say, there's trailing whitespace in the http headers in input and not in out.

How... interesting! I was thinking my digest-checking code was the guilty party, but instead it could be that the copy is dropping that http header trailing whitespace while repeating the same digest? Changing the output by dropping trailing whitespace is dodgy, repeating the digest is much dodgier.

wumpus commented 3 years ago

While I'm here I'll also mention that Transfer-Encoding: chunked is in both input and output headers, but it's not actually chunked. This is a common problem in warcs. warcio happily reads these warcs by falling back to non-chunked.

wumpus commented 3 years ago

@ikreymer I see two choices, you probably have an opinion:

  1. Save the verbatim header bytes while reading, and use them on writing if the user has not modified anything, keeping the digests exactly the same if it's an exact copy
  2. Keep the current header parsing, which might change the header on reading, and always recompute digests on write

Option 1 is a "first, do no harm" philosophy, but it will be a little ugly to notice changes to the headers between read and write. Option 2 is a small code change.