webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
714 stars 58 forks source link

out of memory #65

Closed robert-1043 closed 3 years ago

robert-1043 commented 3 years ago

For an in company use of webarchives, I'm experimenting transforming older Heritrix crawls to wacz (thanks to py-wacz). One of these transforms results in 48GB and reports 325.000 pages. Using this archive with the online replayweb.page results in out of memory. Opening it in the app (Windows 10) functions but not quite smooth.

I suppose the problem lies in the number of pages and/or the text index. What are the limitations for these?

ikreymer commented 3 years ago

Wow, that's a lot of pages! My guess is that it would be the text index, it has not yet been optimized for that size... Are these 325,000 pages, not urls, eg. with page detection? Does it work if there is only one entry page, and no text?

If you are able to share, would be curious to check out the WACZ... If you can share a download link via email, you can email me at ilya [at] webrecorder.net

robert-1043 commented 3 years ago

It's indeed 325k pages. I've found about 80k 'ajaxteaser' pages which were crawled and stored seperately from the containing page by Heritrix. Tried to edit the pages.jsonl and update it in the wacz - no luck so far. Wacz and seperate pages.jsonl / index.cdx is coming your way.

robert-1043 commented 3 years ago

Have recreated the wacz based on the edited pages.jsonl. Contains now 253k pages (the pages I took out didn't contain text, the pages.jsonl is now 369MB in stead of 386MB), No out of memory with the online replayweb.page - the app works smooth.

ikreymer commented 3 years ago

Thanks for sharing the WACZ! We are working on improving support for large page lists / text indexes. Actually, there is now a new mode, still being developed, where in addition to the main page list, you can load an 'extra pages' list that is loaded on-demand. The idea is that it doesn't make sense to include all pages, especially for a crawl, but just the 'seeds', while the remaining pages + text data can be in a separate file (eventually compressed).

Using the WACZ you shared, I extracted it and then recreated it like this:

wacz create --url <url-of-starting-page> -e pages/pages.jsonl -f ./archive/*.warc.gz

This created a WACZ with just one page (the main page) as the starting pages, while the rest of the text data is stored in the 'extra pages' (specified via -e) flag. ReplayWeb.page then loads the text index asynchronously, though currently it is still loaded each time. We'll be looking at ways to optimize this further. Of course, if the -e is removed, then there is no page index and browsing alone should work well, but will be looking for ways to optimize this as well!

Thanks for sharing your results, if you'd like, you should be able to create the WACZ as shown above.

ikreymer commented 3 years ago

Closing, issue seems to be resolved by loading from a WACZ.