Closed deltabravozulu closed 3 years ago
ArchiveWeb.page should already be doing this -- that's why there's a Size Loaded
(non-deduped) and the Size Stored
(deduped) size counters, which probably can be better documented. The size count on the pages is not entirely accurate for this reason as well.
What you're describing is unfortunately a difficulty of archiving complex, dynamic pages, where going from one page to another in different order can trigger different requests to be loaded.
If you can share an example where this is happening, or the pages you were trying to archive, can offer more suggestions / take a closer look at the issues. If the same exact data (with same digest) is appearing multiple times in the WARCs, then it is a bug. The duplicates should appear as revisit records pointing to the first capture
Gotcha! Alright. I thought it was expressing dissimilar behavior to the ol' WebRecorder which had a toggle for deduping if memory serves. I think this situation is heavily tied to the other bug/FR I put in.
Basically, I have, if I remember correctly, a 60GB Size Loaded
and a 30GB Size Stored
which spits out a 21.8GB WARC and a (broken) 21.8 GB WACZ.
If one is recording a site of video content (especially video content which repeats upon, say, a reload or clicking on the link again), the files become huge. Having the ability to intelligently deduplicate already copied info would be useful.
There are certainly use cases where one might want multiple copies of the same page, but as an example, here's a short scenario where one might not, very similar to my experience: