webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
710 stars 58 forks source link

Internal page crashes while loading a large archive #162

Open kiler129 opened 1 year ago

kiler129 commented 1 year ago

I'm sorry for a non-descriptive title, but there's nothing more specific I can really say.

I attempted to load archive from https://archive.org/download/archiveteam_liveleak_20210506071950_2a306039 and it crashes every single time after loading ~3.5GB. I tried opening DevTools and the last message printed is "Read 93000" records. Promptly after dev tools disconnect ("DevTools was disconnected from the page. ...").

I'm running the offline version on macOS v13.2.1 on M1 Max with 32GB memory. The memory pressure is low.

ikreymer commented 1 year ago

To open a large WARC, the application process the entire file. We have the WACZ format to solve this problem, which precompute the index, and packages it together with the WARC. WACZ files have scaled to over 1TB so should definitely work. Are you able to run a command-line tool to convert WARC->WACZ? If so, we have py-wacz and now also js-wacz which can do so on the command-line. Then, the file will open fairly quickly and you'll be able to search through and replay right away.