webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
607 stars 79 forks source link

Config payload Digest sha1 base32 #532

Open gitreich opened 5 months ago

gitreich commented 5 months ago

I couldn't find a setting to configure the Digest to sha1 base32 as our entirley archive (even ARC!) contains Digest with sha1 base32 Actually it is set to sha256 hex We face problems in the deduplication with the Digest as sha256 hex, as in the CDX is base32 sha1 used, it is not possible to use the CDX for deduplication without regenerating the entire Index.

For us the most easy solution would be to make it configureable as parameter (--digest-encoding string, possilities: base16, base32, base64 and one of them as default (for us base32 would be grat as default) ) see also https://datatracker.ietf.org/doc/html/rfc3548

The Version 0.12.4 was using sha1 base32 Version 1.0.2 is now using sha256 base16

tw4l commented 4 months ago

Hi @gitreich - putting this on our sprint board to look into after IIPC WAC :)

gitreich commented 4 months ago

Hi; At the WAC24 @ikreymer brought up the idea to make a parameter for adding the location of the CDXIndex (for DeDup via writing revisit entries) If this feature would come, this issue here could be also handled via a CDXParameter: Read Out of the given CDX the payload digest format and continue writing into the new generated WARCs with the Digest found in the given Index