webrecorder / specs

Specifications developed and maintained by the Webrecorder community.
https://specs.webrecorder.net
123 stars 13 forks source link

New Spec: Nested WACZ files? #129

Open ikreymer opened 2 years ago

ikreymer commented 2 years ago

Related to multi-WACZ / aggregated WACZ loading #112, a possible idea is to support nested WACZ files, eg. ZIP files containing other WACZ files, and a datapackage.json. The main use case for this would be parallel crawlers which produce multiple WACZ files which are signed individually. For packaging / distribution, it is still convenient to bundle the output into a single file. This makes sense if the reason for having multiple WACZ output is parallelism, and not size limits. Some questions to answer around this:

An alternative would be to simply merging WACZ files, merging the CDXJ, page lists, etc.., which is also doable, but more work (both to implement and to run).

ikreymer commented 2 years ago

Clarifying a bit more, there are two key reasons for resulting in multiple WACZ files: A) A parallel crawl with multiple crawls each producing own WACZ file, with each one having a subset of pages. B) A single crawl that reaches a certain data size limit, where adding to one file is no longer desirable (eg. perhaps over 100GB?)

The solution for these are as follows: 1) Combine smaller WACZ files into a single one by merging the .cdxj and creating a new WACZ files with all the WARCs 2) Combining smaller WACZ files into a new 'nested' WACZ, as described above. 3) Creating a JSON manifest of multiple WACZ files, as discussed in #112

Options 1) and 2) are good solutions for reason A - where multiple WACZ files exist due to parallel crawling, and can be quite small. However, option 3) may be the best option for reason B - where multiple WACZ files exist because the size of each one is already quite large. Probably we will need the JSON manifest 3) and either 1) or 2) as well, unless we decide to only support JSON manifest.

edsu commented 1 year ago

@ikreymer for nesting would we need a new file name and extension for nested WACZ files that is distinct from WACZ? If not won't WACZ viewers need to account for whether the WACZ was nested or not and behave accordingly?

If we want to consider nesting as part of WACZ I think this would mean updating the WACZ specification to include this nesting functionality directly, or at least pointing to the separate WACZ Aggregation specification?

edsu commented 1 year ago

In the use case above where each WACZ is individually signed, is the issue that the cert that is being used to sign each WACZ needs to be different? Or is it simply a technical convenience to get around CDXJ merging? Or are there other issues at play?