webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd
GNU Affero General Public License v3.0
843 stars 60 forks source link

ZST File Support #68

Closed blandes02 closed 2 years ago

blandes02 commented 2 years ago

I Tried To Open youtubedislikes_20211213070444_9307757f.1638107855.megawarc.warc.zst, But They Couldn't Open The File. Could You Please Support ZST Files? That'll Be Great.

edsu commented 2 years ago

It does look like ZStandard compression was added to Zip in 2020?

https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.7.TXT

@ikreymer does there need to be more clarification in the spec around what compression algorithms are allowed? Or is the assumption that anything ZIP allows should be supported?

edsu commented 2 years ago

@blandes02 what tool did you use for creating the .zst file?

blandes02 commented 2 years ago

I Don't Use The Tools. I Just Want To Open The File.

ikreymer commented 2 years ago

@edsu it was probably created with https://github.com/ArchiveTeam/wget-lua, which is used by archiveteam to create zstd warcs.

Support for zstd for warcs has been discussed before, see:

For archiveweb.page and replayweb.page, we'd also need js/wasm implementation of zstd, which would need to be implemented in wabac.js and warcio.js. Would probably start with replay and reading existing zstd warcs. For writing, would need to figure out how to generate a proper dictionary.

It's generally not a priority for archiveweb,page, as we're dealing with mostly smaller size archives here.

Closing this for now, place to start would probably be wabac.js, then replayweb.page, and we don't have resources to focus on this at the moment.

ikreymer commented 2 years ago

It does look like ZStandard compression was added to Zip in 2020? https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.7.TXT

For WACZ, we're not changing the compression of WARCs, so this isn't specifically relevant - the WARCs are always added with 'store' compression to the WACZ.

We'd need to focus on reading zstd warcs first.