webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Apache License 2.0
426 stars 39 forks source link

Separate CDX File Support #35

Closed Saklad5 closed 5 years ago

Saklad5 commented 7 years ago

For larger WARC files, such as those produced by recursing through a website with wget for archival purposes, it is common to have tens of thousands of records. This is extremely slow to index on demand, so reading preexisting .cdx files would massively improve launch times.

This could be implemented by reading identically-named .cdx files in the same directory as the WARC file being opened, or by allowing the user to select the file manually. CDX files are explicitly designed to be machine-parsable, so it shouldn’t be difficult to implement that part.

ikreymer commented 5 years ago

There is now an automated cache that is created (in the _warc_cache directory in the same place as the WARC) which caches the WR Player state, including the cdx. The cache is more than just the cdx, and allows for fast reloads of WARC files that have been loaded before.

You should be able to load previously opened WARCs pretty quickly with version 1.6.0 and up.