webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.41k stars 218 forks source link

Run WARCs without having to put them into files #677

Closed DUOLabs333 closed 3 years ago

DUOLabs333 commented 3 years ago

Is your feature request related to a problem? Please describe.

The main problem with WARCs is that everytime you want to run them from cold-boot, you have to extract the file which takes time.

Describe the solution you'd like

To have an option to run WARCs unextracted

ikreymer commented 3 years ago

Extraction of a single resource is pretty fast, but indexing the WARC to determine what URLs it contains can take time. The WARCs are stored as is, but an index is generated when first added to pywb.

You may be interested in a new format we're developing, called WACZ (https://github.com/webrecorder/wacz-spec) which bundles WARCs along with indexes into a single ZIP file for very fast access. You can create WACZ files with py-wacz and then load them quickly with replayweb.page

DUOLabs333 commented 3 years ago

Oh, right, I forgot that indexing is the bottleneck.

ikreymer commented 3 years ago

Oh, right, I forgot that indexing is the bottleneck.

Yes, exactly, and the WACZ format addresses this by bundling the indexes together with the WARCs, as well as other metadata. WACZ is not yet supported in pywb but support is planned in the future.