Open jcahill opened 4 years ago
While standardizing the hierarchy by itself may be interesting for other use cases, in order to achieve the two goals that motivated the creation of WACZ the details of the encapsulation are essential. It needs to be a single file so it can easily be shared easily and that single file needs to be constructed carefully, not just any generic container format, in order to allow incremental loading without downloading/reading the entire collection.
~Ah, I think I misunderstood you. You're just saying you'd like to see versioning and fixity as features and suggesting that BagIt or OCFL could be added as structural layers to provide those features. I think I was confused because you mentioned how BagIt eliminated the specification about ZIP but for WACZ the details around ZIP are actually essential to achieving its goals and so cannot be eliminated.~
Edit: I confused atomotic and jcahill as the same person. My bad!
other container formats satisfy this need like bagit or the newest ocfl.
so why not keep wacz format as simple as possible and relative only to the webarchiving domain and organize collections of wacz inside ocfl?
My impression is that OCFL is especially designed specifically around the need to store multiple versions of data and their digests. But that doesn't apply to WARC files, since there's never going to be a 'v2' of the same WARC file.
I suppose using Bagit may be a better fit, but that wouldn't address the random-access requirement, for which the Zip bundling is still necessary..
Maybe there should be a separate WAC directory layout, and the Z part for packing it up as a single Zip file..
But, are users going to open the expanded file, or just use it as sort of a black box, eg. the way a .docx files generally are?
I suppose maybe that could be useful if a collection is being actively edited, though its not designed as an edit-in-place format..
sorry, i have a precarious connection in train i mistakenly deleted the previous comment.
got the point, ocfl design is not useful here. Bagit instead, could be zipped uncompressed. the bagit package (golang) of https://github.com/ndlib/bendo does this as example
One issue that seems to arise from the current draft spec is loss of separation of concerns with respect to accessing and modifying components of the collection. This speaks to both (a) the need for different parties to have differing levels of access to distinct materials and (b) the need to be confident that the underlying capture data has not changed and is being given a wide berth.
Some scenarios that come to mind:
wacz_new
from a subset ofwacz_orig
.Some of these issues could be solvable with some scoping of when exactly the encapsulation is expected to occur in relation to content changes. If the wacz spec is to be seen more as a sort of collection layout convention than an archive file format, compression could itself remain optional, only needing to come into play as a storage-mode consideration, i.e. when collections aren't in a state of heavy development. BagIt's evolution comes to mind. Wikipedia:
The outer zip container is effectively a glorified suitcase for the data and metadata here (wacz draft), so it stands to reason that it might not always be strictly necessary. The hierarchy's hammering down of certain conventions for pairing of web archival data files and their sidecar metadata files strikes me as much more important.
The most important question for me, then, lies in how to effectively reason about contents already in wacz hierarchies, especially for the purposes of aggregating and disaggregating them.