Add a smart dedupe feature

deltabravozulu commented 3 years ago

If one is recording a site of video content (especially video content which repeats upon, say, a reload or clicking on the link again), the files become huge. Having the ability to intelligently deduplicate already copied info would be useful.

There are certainly use cases where one might want multiple copies of the same page, but as an example, here's a short scenario where one might not, very similar to my experience:

   ACT I :

You're a diligent young archivist and have been visiting a website called videosite.com and hold an account there (it isn't necessary, but having an account lets you visit more pages). This is a website with streaming educational video content loaded from .m3u8 playlists that is not easily crawled through other means.

You start recording, go to videosite.com, the homepage of the site, and log into your videosite.com account.

You click the first link on the page, which takes you to videosite.com/page/1. There are ten pages in total. You click play on a few of the videos and then when the video stream finishes buffering through, you click "Next" and go to videosite.com/page/2.

You repeat this a few times until you hit page 10, and then you click on a user profile. You notice that when you click one of his favorite videos, it takes you to videosite.com/page?id=9 which looks suspiciously like videosite.com/page/9. You then click play on all the videos on the page and let archiveweb.page take care of the rest. You click "next" and now you are at videosite.com/page?id=10, which again looks suspiciously like videosite.com/page/10.

You've seen all these videos, so you hit the back button, and end the recording.

  ACT II:

Oddly, the filesize is pretty big. Oh well. You export a .WARC file. You decide to open it up with your internet off and see if you can view the pages right. You click on videosite.com/page/1. All good. You click on videosite.com/page/4...and you see...nothing except an error message!

You think, "WTF? Page 4 wasn't saved? Well, wait, maybe if I go back to page 1 and then click through it will work."

It does. But the reverse does not. You only went forward. Never back. And when you go to the profile from earlier and click a different favorite video, it tries to load videosite.com/page?id=6 but it also isn't there.

  ACT III:

You decide to try to do this right. This is your favorite site to learn from. You go from page/1 to page/10. You go from page/10 to page/1. You go to page?id=1 to page?id=10 and back.

You play everything on each page just in case not playing it means it won't load later. You end the recording.

  ACT IV:

The file has now nearly quadrupled in size from your original. All the JS, the TS videos, the css, etc. for each page has been copied in quadruplicate. And there's nothing you can do except try again.

ikreymer commented 3 years ago

ArchiveWeb.page should already be doing this -- that's why there's a Size Loaded (non-deduped) and the Size Stored (deduped) size counters, which probably can be better documented. The size count on the pages is not entirely accurate for this reason as well. What you're describing is unfortunately a difficulty of archiving complex, dynamic pages, where going from one page to another in different order can trigger different requests to be loaded.

If you can share an example where this is happening, or the pages you were trying to archive, can offer more suggestions / take a closer look at the issues. If the same exact data (with same digest) is appearing multiple times in the WARCs, then it is a bug. The duplicates should appear as revisit records pointing to the first capture

deltabravozulu commented 3 years ago

Gotcha! Alright. I thought it was expressing dissimilar behavior to the ol' WebRecorder which had a toggle for deduping if memory serves. I think this situation is heavily tied to the other bug/FR I put in.

Basically, I have, if I remember correctly, a 60GB Size Loaded and a 30GB Size Stored which spits out a 21.8GB WARC and a (broken) 21.8 GB WACZ.

webrecorder / archiveweb.page

Add a smart dedupe feature #30