webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
688 stars 55 forks source link

.warc file loading stops at 30-40% #67

Closed apataga closed 2 years ago

apataga commented 3 years ago

Hello! Replayweb.page app (v1.4.0—1.5.2 on Widndows 10 20H2) stops at 30-40% loading of any .warc file > 1GB from this save of gamerankings.com — https://archive.fart.website/archivebot/viewer/job/9uxhl At the same time, 650MB file is loaded completely. And your past program (Webrecorder player) loads any .warc file but is much slower.

ikreymer commented 3 years ago

Thanks, will see if there's anything that can be done, its a bit tricky since the entire file must be read in the browser.

For much better results, the recommendation is to convert the WARC filses there into a WACZ file, which can be done using the Python py-wacz tool (https://github.com/webrecorder/wacz-format/tree/main/py-wacz). The web archives could then be read quite quickly w/o needing to load the entire file. Perhaps that's even something that ArchiveBot Viewer could do automatically...

ikreymer commented 3 years ago

I was able to load one of the larger WARCs on an OSX 1.5.2 player, so it seems like it may be a bit less predictable.

apataga commented 3 years ago

Understood. Maybe you can explain how to restore gamerankings.com as an offline site? To be able to search by gaming platforms, sort by date, etc. All links on the Internet about opening .warc files lead to your programs. :)

qube1t commented 3 years ago

I am using Chrome in Windows and it is happening for a 340 MB WARC file. Mine is from archive.org.

MeekoMechy commented 2 years ago

Hi! I'm also having trouble with the app. Im always getting stuck at 77%. Im using the app from here and also chrome. I am trying to open a 5gb WARC file

DUOLabs333 commented 2 years ago

Yeah, you either have to use WARCZ, which adds the index to the file (don't know why it wasn't in the original spec), or you must use pywb to index and play it back on your own website. Is there a reason why you must use a separate website, rather than hosting your own?

ikreymer commented 2 years ago

We would like to support this use case with replayweb.page as well, separate from pywb, so definitely hope to fix this! I haven't been able to reproduce this issue consistently yet and hoping we can improve this a bit.

Would also be great to be able to offer WARC->WACZ conversion in the browser, which would require loading the WARC fully at least once.

also, running pywb requires someone to run a web server, while replayweb.page can host a web archive from any static storage, so these are slightly different use cases.

ikreymer commented 2 years ago

For anyone having issues with WARCs getting stuck loading, can you try loading on this dev version at: https://dev.replayweb.page/

This version uses a web worker to do the loading, and also lists the actual number of records loaded along the percentage. I'd be curious to know if:

The percentage reflects the total size of the WARC loaded, however, it is not uniform for how much time it will take. Eg. a WARC with 100 small records in 1MB will probably take longer than a single record of 100MB. I am curious if the WARC is actually getting stuck, or if its just loading very slowly, which would help how best to address the issues.

ikreymer commented 2 years ago

This is now up on the main replayweb.page, I have not been able to detect WARCs getting completely stuck, so closing this for now. Occasionally, it can be slow, however, especially on Firefox. Chrome/Chromium-based browser seem to load much more quickly, which may be a separate issue to investigate..