webrecorder / specs

Specifications developed and maintained by the Webrecorder community.
https://specs.webrecorder.net
122 stars 13 forks source link

Encapsulation, access control, and fixity #14

Open jcahill opened 4 years ago

jcahill commented 4 years ago

One issue that seems to arise from the current draft spec is loss of separation of concerns with respect to accessing and modifying components of the collection. This speaks to both (a) the need for different parties to have differing levels of access to distinct materials and (b) the need to be confident that the underlying capture data has not changed and is being given a wide berth.

Some scenarios that come to mind:

  1. Haphazard in-place modification of the record leads to container integrity issues.
  2. Access controls require deriving a wacz_new from a subset of wacz_orig.
  3. Certain groups are only ever interested in select subsets of the data. They need it in bulk, so they need raw download. But everything else is dead weight.
  4. Updating of wacz containers obscures a fixity issue with the records.

Some of these issues could be solvable with some scoping of when exactly the encapsulation is expected to occur in relation to content changes. If the wacz spec is to be seen more as a sort of collection layout convention than an archive file format, compression could itself remain optional, only needing to come into play as a storage-mode consideration, i.e. when collections aren't in a state of heavy development. BagIt's evolution comes to mind. Wikipedia:

Until version 15, the draft also described how to serialize a bag in an archive file, such as ZIP or TAR. From version 15 on, the serialization is no longer part of the specifications, but not because of technical reasons but only because of the scope and focus of the specification.

The outer zip container is effectively a glorified suitcase for the data and metadata here (wacz draft), so it stands to reason that it might not always be strictly necessary. The hierarchy's hammering down of certain conventions for pairing of web archival data files and their sidecar metadata files strikes me as much more important.

The most important question for me, then, lies in how to effectively reason about contents already in wacz hierarchies, especially for the purposes of aggregating and disaggregating them.

ato commented 4 years ago

While standardizing the hierarchy by itself may be interesting for other use cases, in order to achieve the two goals that motivated the creation of WACZ the details of the encapsulation are essential. It needs to be a single file so it can easily be shared easily and that single file needs to be constructed carefully, not just any generic container format, in order to allow incremental loading without downloading/reading the entire collection.

ato commented 4 years ago

~Ah, I think I misunderstood you. You're just saying you'd like to see versioning and fixity as features and suggesting that BagIt or OCFL could be added as structural layers to provide those features. I think I was confused because you mentioned how BagIt eliminated the specification about ZIP but for WACZ the details around ZIP are actually essential to achieving its goals and so cannot be eliminated.~

Edit: I confused atomotic and jcahill as the same person. My bad!

ikreymer commented 4 years ago

other container formats satisfy this need like bagit or the newest ocfl.

so why not keep wacz format as simple as possible and relative only to the webarchiving domain and organize collections of wacz inside ocfl?

My impression is that OCFL is especially designed specifically around the need to store multiple versions of data and their digests. But that doesn't apply to WARC files, since there's never going to be a 'v2' of the same WARC file.

I suppose using Bagit may be a better fit, but that wouldn't address the random-access requirement, for which the Zip bundling is still necessary..

ikreymer commented 4 years ago

Maybe there should be a separate WAC directory layout, and the Z part for packing it up as a single Zip file..

But, are users going to open the expanded file, or just use it as sort of a black box, eg. the way a .docx files generally are?

I suppose maybe that could be useful if a collection is being actively edited, though its not designed as an edit-in-place format..

atomotic commented 4 years ago

sorry, i have a precarious connection in train i mistakenly deleted the previous comment.

got the point, ocfl design is not useful here. Bagit instead, could be zipped uncompressed. the bagit package (golang) of https://github.com/ndlib/bendo does this as example