webrecorder / specs

Specifications developed and maintained by the Webrecorder community.
https://specs.webrecorder.net
124 stars 14 forks source link

Interfacing with WASAPI #12

Open machawk1 opened 4 years ago

machawk1 commented 4 years ago

Some of the collection-based retrieval aspects of this specification are particularly interesting, like the ability to specify specific pageIDs of interest.

As you are very well aware, @ikreymer, WASAPI is an abstracted spec for WARC retrieval with a few specifications. I can imagine a WACZ layer to make WASAPI implementations a bit more usable from both a macro and collection-based querying standpoint, as it seems to provide some standard semantics.

Because you have solicited thoughts in this repo, I wondered about consideration of interfacing with WASAPI and/or potentially providing endpoints or routes that align with WACZ.

I am looking forward to further discussion.

ikreymer commented 4 years ago

Yes, since WASAPI is a data transfer API, while WACZ is designed to be a storage specification, there isn't any overlap, but they could definitely complement one another!

I think a main limitation is of WASAPI is that it allows you to download a bunch of WARCs in bulk, but then what do you do with them? A tool could use WASAPI to download WARCs in bulk and then assemble them into a WACZ file, which could be a stable format that could then be instantly usable in replayweb.page or added to other storage. I believe WASAPI is also missing support for any metadata, such as page/seed lists, which would probably also need to be added.