webrecorder / py-wacz

MIT License
39 stars 10 forks source link

Canonical method for converting multiple WARC files to WACZ #33

Open jackdos opened 1 year ago

jackdos commented 1 year ago

I'm not sure if this is a feature request or just a request for clarification, but I'm looking for a canonical way to generate a WACZ file from multiple WARC files.

I am dealing some web collections that span multiple WARCs, but should be represented as a single WACZ. From the command line I can get this to work by putting all the WARC files in a single folder and running:

wacz create -o test.wacz -f warcs/*.warc

however, I have failed with multiple attempts to cleanly invoke this from within a java wrapper. I've tried different combinations of different levels of escaping and quoting of parameters, but to no avail. Either way I assume this is relying on either OS or python expansion of the * wildcard, and it's not clear what would and would not be expected to work in terms of wildcards, regex expressions etc.

What I'm looking for ideally is either for the -f parameter to be repeatable (in the way that -i is in ffmpeg) so that each file can be explicitly listed; or to be able to specify a -d parameter to point to a directory explicitly expected to contain multiple warc files. The directory option would probably need to let you specify what file extensions to consider, or should clearly document what happens when non warc content is found in the directory.

quinn commented 1 year ago

Bump, I'm also having trouble with this.

ikreymer commented 1 year ago

Sorry missed this earlier! The -f warcs/*.warc is relying on shell expansion to fill in the file list. The -f flag works as you are suggesting, it is expecting a list of filenames (relative to current working directory or absolute) after the -f param. eg. -f warcs/a.warc warcs/b.warc ... warcs/n.warc should work.

This is what we do in the crawler, generate a list of WARC files, and then pass each one as a param after the -f param: https://github.com/webrecorder/browsertrix-crawler/blob/main/crawler.js#L881

quinn commented 1 year ago

Thanks that works!

jackdos commented 1 year ago

OK, great, was just a request for clarification then!