webrecorder / specs

Specifications developed and maintained by the Webrecorder community.
https://specs.webrecorder.net
124 stars 14 forks source link

Alternate containers and compression methods? #15

Closed jcahill closed 4 years ago

jcahill commented 4 years ago

There are a handful of other ways to get a compressible outer container to do what the zip file is currently doing in the draft spec. This might be preferable to a largely-uncompressed (STORE mode) zip. Sysadmins and others interacting with zip-style WACZ files will always need to know not to try to optimize them further, for instance.

The other structural disadvantage of zip is that the index is in the footer, which is a likely region to be damaged.

ikreymer commented 4 years ago

The point of the zip format, and using STORE, is that it allows for random-access to data stored in the ZIP. WARC files are already compressed, and the indices can the compressed cdx.gz as well. They are stored in the ZIP as STORE so that byte offsets loaded from the index can be looked up inside the WARC file inside the zip, quickly. This is what allows quick random access using this format.

ikreymer commented 4 years ago

The nice thing about this is that it allows certain files to be specified as 'don't compress anymore' because they're already compressed, while other types of data (that is not looked up via random access), can still be compressed by the zip. There are ways of verifying the integrity of the ZIP, I believe each file has a checksum entry for example, so I don't think that is an issue.

And internally, WARCs, which would comprise the bulk of the data, could still use a different compression format (eg. ZSTD) and still be bundled with STORE in the Zip, so that should be quite extensible.. Do you have any specific examples for alternatives? I think the ubiquity of Zip and its random-accessness makes it an especially good choice here.

jcahill commented 4 years ago

Yes, I understand the conceit of the zip container. I'm suggesting that outer compression and random access are both possible. Probably encryption too. Archivists can prescribe data as not for further compression all day, but we should expect attempts at compression anyway. If we can head that off at the pass by building a respectable level of compression into the spec, that will help. Since any attempt to repack the container with outer compression breaks the core functionality of the wacz bundle, file archiver [tools] will break previously-unpacked wacz hierarchies by default. That describes every naive re-zipping of a worked wacz.

RE: zip versus other options: zip is well-known, but the particular feature being exploited here is obscure. 7z might be an attractive alternative. As it stands, the only generalist file archiver applications with warc support are 7z-centric. I'll want to take a stab at a proof of concept before naming any more obscure formats.

Neither zip nor 7z use recovery records, for what it's worth. So long-term damage is a concern in both cases.

ikreymer commented 4 years ago

The goal is not to have users manually package or repackage wacz using low level zip tools - doing so is possible but would be done at your own risk, so to speak. The intent is to have higher level tools, like py-wacz, that will handle these operations, eg. wacz add, wacz extract, wacz verify, etc... that can ensure the integrity of not just the zip file also adherence to the wacz spec. The file format is .wacz not .zip.

For example, a .docx, .pptx files are also actually zip files, but you don't generally edit them with zip tools, and doing so could easily break them. These files are generally opened and created with higher level tools (word processors, presentation, etc...).

It's unclear what 7z would offer here exactly, or how it would provide random access + compression, and I think there's plenty of other issues to resolve. There is also a key requirement for this to run client-side, which requires faster random access (no binary search over http, etc...). I guess WACZ is optimizing for fast read access, not necessarily smallest file size.

ato commented 4 years ago

file archiver [tools] will break previously-unpacked wacz hierarchies by default

I agree that this could be a real source of confusion and is the same problem as WARC's use of multi-member gzip. What made it particularly confusing in the case of .warc.gz was the file extension would cause users and tools think it was just "default gzip". Certainly changing the file extension doesn't entirely eliminate that possibility for people who know just enough of the technical detail to realize it uses zip but who don't actually read the spec. However I do think changing the extension goes along way to mitigating the problem for normal users. I also think there are significant benefits to reusing a widely-supported and well-understood format as a basis and that those benefits probably make it worthwhile in spite of this problem.

The other structural disadvantage of zip is that the index is in the footer, which is a likely region to be damaged.

It's true that if the file is truncated, which could also happen due to issues with filesystems or media not just application-level, the central directory could be lost. Fortunately zip stores a second copy of the most essential metadata in local headers in front of each entry which zip repair tools can make use of. Locating the central directory at the end of the file is also what enables extra or updated members to be added to a zip without rewriting the entire file and that's quite useful feature to have and we've even talked about making use of it to incrementally update collection metadata within WACZ files.

7z might be an attractive alternative

I'm not that familiar with 7z but it doesn't seem to store records uncompressed by default nor does it seem to allow random access within a compressed record. Therefore it seems like it would suffer from the same problems of users repacking with non-aware tools. I guess people are less familiar with it and so might be less likely to recognize it and use general tools on it but that doesn't seem like a compelling line of reasoning.

I guess the solution in an ideal world to your first problem is if there was a standardized and widely-adopted compression format that supports random access by default. Unfortunately it's rare to even find one that offers it as an option let alone by default.

As for the index-at-the-end issue, according to this article 7z, just like zip, keeps the index at the end of the file. However unlike zip, 7z doesn't have a second copy of that metadata in local headers and so if the end header is corrupted the filenames are lost entirely. So unfortunately 7z actually seems to be less recoverable in the face of damage to the end of the file than zip is.

7z therefore unfortunately doesn't seem to offer a solution to either problem.

jcahill commented 4 years ago

@ato : I took a closer look at some options.

Multi Layer Archive (MLA), from ANSSI, is essentially a development month away from being an ideal candidate.

The only blocker to a proof of concept that checks every box is https://github.com/ANSSI-FR/MLA/issues/15.

ikreymer commented 4 years ago

@jcahill This is an experimental format that's not released yet, and so far only supported in Rust. It may be promising, but would not try to base on something so experimental.

It is also unclear how well it would work fetching over HTTP.

To speed up the decompression, and to make the layer seekable, a footer is used. It saves the compressed size. Knowing the decompressed size, a seek at a cleartext position can be performed by seeking to the beginning of the correct compressed block, then decompressing the first bytes until the desired position is reached.

I think this means that a compressed block contains more than actual data, so will need to fetch more and 'throw away' first N bytes to get to range needed, which may be a perf hit. The current approach fetches exactly the bytes needed and decompresses the entire block, if necessary.

There's a lot of things still to figure out, but I choosing an alternative to Zip does not seem necessary, and seems to be the best approach so far. Closing this for now.

commial commented 3 years ago

Hi, MLA developer here. I don't want to resurrect the thread, just to ensure someone getting on this thread later will have all the info. There are no additional data in compressed blocks, only at the end of the file (ie, after the N compressed blocks). This data is use only for speeding things up, and is actually optional (to support the repair capability).

I don't know if the format is suitable for your needs, but if it is still an option for you, I would be happy to discuss about it.