Closed DUOLabs333 closed 3 years ago
Extraction of a single resource is pretty fast, but indexing the WARC to determine what URLs it contains can take time. The WARCs are stored as is, but an index is generated when first added to pywb.
You may be interested in a new format we're developing, called WACZ (https://github.com/webrecorder/wacz-spec) which bundles WARCs along with indexes into a single ZIP file for very fast access. You can create WACZ files with py-wacz and then load them quickly with replayweb.page
Oh, right, I forgot that indexing is the bottleneck.
Oh, right, I forgot that indexing is the bottleneck.
Yes, exactly, and the WACZ format addresses this by bundling the indexes together with the WARCs, as well as other metadata. WACZ is not yet supported in pywb but support is planned in the future.
Is your feature request related to a problem? Please describe.
The main problem with WARCs is that everytime you want to run them from cold-boot, you have to extract the file which takes time.
Describe the solution you'd like
To have an option to run WARCs unextracted