webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

warcio recompress adds WARC-Block-Digest fields to records without one #161

Open acidus99 opened 10 months ago

acidus99 commented 10 months ago

It appears that warcio recompress will add WARC-Block-Digest fields to records that do not already have that field.

In the ZIP there are 2 warcs. example-warcs.zip

In orig.warc the warcinfo record at the start does not have a WARC-Block-Digest field at all. However if you run:

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

And look at warc-recompress.warc you will see that the warcinfo record now has WARC-Block-Digest with a SHA1 hash. (I included a copy of warc-recompress.warc in the ZIP).

While I suppose more digests aren't a bad thing:

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.