Closed YousufSSyed closed 1 year ago
Hi @YousufSSyed, de-duping WARCS after recording isn't currently supported in pywb. The extent of the dedup feature is documented in https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording, which you've likely already seen
Alright.
few tools to look into if dedupping warcs after crawling is needed: https://nlnwa.github.io/warchaeology/ https://github.com/arcalex/warcrefs
@ssairanen Thanks for the suggestions, though I found and am using this one I really like: https://github.com/tari/warcdedupe. And warcrefs hasn't been updated in 8 years.
None of these handles deduping between Warc files, right? I.e. deduping between several selective crawls at different occasions.
@petsva Pywb creates .warc.gz
files. gzip files can be combined in the terminal like so: cat warc1.warc.gz warc2.warc.gz > warc3.warc.gz
. Then you can dedup warc3.
That's true but that becomes unmanageable if you have lots of large warc files.
@petsva I don't know exactly how off the top of my head, but you could write a shell script to automate it. Perhaps cat all the warc files into one, delete all the other warcs, and then run warcdedupe on the combined one. Though if you do so, you'd want to run wb-manager reindex
afterwards, and I have this issue opened: No such file or directory after deleting a WARC and reindexing.
I haven't yet tried any of those, but warchaeology seems to have parameter "--keep-index" which I presume to be able to save index on hdd and use it in future runs
None of these handles deduping between Warc files, right? I.e. deduping between several selective crawls at different occasions.
Warchaeology's warc dedup
seems to support building and referring to cached indexes. I've not checked other tools.
EDIT Darn it @ssairanen beat me by two minutes! 😠😂
@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex
command?
@YousufSSyed:
Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1), it's no use to have them all around, you can replace later ones with WARC revisit record that points to earliest crawled version.
This is good if you have recurring crawl, and crawler crawls same site many many times, no need to have 100 copies of "sitelogo.jpg", only 1 is enough, and revisit records of later crawls point to that.
My possible scenario is hundreds of daily harvest warcs with no deduplication. So I would like to build indexes of them and then generate deduplicated warcs for each day, which points back. Maybe that can be done with warchaeology, should test.
Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1)
@ssairanen @anjackson No, I meant the warc dedup tool's --keep-index
argument and "support building and referring to cached indexes."
How about you re-read my question:
@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex command?
Deduplication means that you take bunch of WARCs, and if there are many WARC records that cointain same response URI and content (checked with some hash algo, like sha1)
@ssairanen @anjackson No, I meant the warc dedup tool's
--keep-index
argument and "support building and referring to cached indexes."How about you re-read my question:
@ssairanen @anjackson I don't know much about WARCs actually. What's the benefit of this? Does it mean you wouldn't have to use the wb-manager reindex command?
I read your question and short answer would be: No, wb-manager reindex has nothing to do with deduplication or deduplication tool's keep-index parameter, as keep-index means to store index for URI:HASH pairs for deduplication purposes, and wb-manager creates cdx index for use of warcs.
Benefit of this would is that you might need half of the disk space for your warcs.
I have
dedup_policy: revisit
in my config.yaml but I'd like to also dedup after pages have been recorded.