Is there a way to dedup WARCs after recording them?

webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

https://pypi.python.org/pypi/pywb

GNU General Public License v3.0

1.34k stars 207 forks source link

Is there a way to dedup WARCs after recording them? #836

Closed YousufSSyed closed 1 year ago

YousufSSyed commented 1 year ago

I have dedup_policy: revisit in my config.yaml but I'd like to also dedup after pages have been recorded.

tw4l commented 1 year ago

Hi @YousufSSyed, de-duping WARCS after recording isn't currently supported in pywb. The extent of the dedup feature is documented in https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording, which you've likely already seen

YousufSSyed commented 1 year ago

Alright.

ssairanen commented 1 year ago

few tools to look into if dedupping warcs after crawling is needed: https://nlnwa.github.io/warchaeology/ https://github.com/arcalex/warcrefs

YousufSSyed commented 1 year ago

@ssairanen Thanks for the suggestions, though I found and am using this one I really like: https://github.com/tari/warcdedupe. And warcrefs hasn't been updated in 8 years.

petsva commented 1 year ago

None of these handles deduping between Warc files, right? I.e. deduping between several selective crawls at different occasions.

YousufSSyed commented 1 year ago

@petsva Pywb creates .warc.gz files. gzip files can be combined in the terminal like so: cat warc1.warc.gz warc2.warc.gz > warc3.warc.gz. Then you can dedup warc3.

petsva commented 1 year ago

That's true but that becomes unmanageable if you have lots of large warc files.

YousufSSyed commented 1 year ago

@petsva I don't know exactly how off the top of my head, but you could write a shell script to automate it. Perhaps cat all the warc files into one, delete all the other warcs, and then run warcdedupe on the combined one. Though if you do so, you'd want to run wb-manager reindex afterwards, and I have this issue opened: No such file or directory after deleting a WARC and reindexing.

ssairanen commented 1 year ago

I haven't yet tried any of those, but warchaeology seems to have parameter "--keep-index" which I presume to be able to save index on hdd and use it in future runs

https://nlnwa.github.io/warchaeology/cmd/warc_dedup/

anjackson commented 1 year ago

None of these handles deduping between Warc files, right? I.e. deduping between several selective crawls at different occasions.

Warchaeology's warc dedup seems to support building and referring to cached indexes. I've not checked other tools.

EDIT Darn it @ssairanen beat me by two minutes! 😭 😂

YousufSSyed commented 1 year ago

@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex command?

ssairanen commented 1 year ago

@YousufSSyed:

Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1), it's no use to have them all around, you can replace later ones with WARC revisit record that points to earliest crawled version.

This is good if you have recurring crawl, and crawler crawls same site many many times, no need to have 100 copies of "sitelogo.jpg", only 1 is enough, and revisit records of later crawls point to that.

petsva commented 1 year ago

My possible scenario is hundreds of daily harvest warcs with no deduplication. So I would like to build indexes of them and then generate deduplicated warcs for each day, which points back. Maybe that can be done with warchaeology, should test.

YousufSSyed commented 1 year ago

Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1)

@ssairanen @anjackson No, I meant the warc dedup tool's --keep-index argument and "support building and referring to cached indexes."

How about you re-read my question:

@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex command?

ssairanen commented 1 year ago

Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1)

@ssairanen @anjackson No, I meant the warc dedup tool's --keep-index argument and "support building and referring to cached indexes."

How about you re-read my question:

@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex command?

I read your question and short answer would be: No, wb-manager reindex has nothing to do with deduplication or deduplication tool's keep-index parameter, as keep-index means to store index for URI:HASH pairs for deduplication purposes, and wb-manager creates cdx index for use of warcs.

Benefit of this would is that you might need half of the disk space for your warcs.