webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.42k stars 217 forks source link

wb-manager - Missing autoindex command - cannot autoindex a single collection? #619

Open jwest75674 opened 3 years ago

jwest75674 commented 3 years ago

Is your feature request related to a problem? Please describe.

I am working with filtered downloads of the Common Crawl dataset (~100TB, with plans to grow to ~200TB), so auto-indexing all collections appears unrealistic, per the experience noted in #541

However, I am hoping to record (via proxy) as I am using my day to day machine, since I have the rest of the infrastructure setup.

As I do not see a flag to auto-index only a single collection, I found reference to, and had hoped to use wb-manager autoindex, which previously indexed a single collection, sidestepping #541 , but it appears that this command does not exist anymore?

Describe the solution you'd like

Few options: 1.a) Bring back wb-manager autoindex 1.b) Allow pywb/wayback to autoindex a specific collection 2) OR Update the docs to make it more clear how to accomplish this with existing tools

Describe alternatives you've considered

Thank you for the great work, pywb is awesome!

rlskoeser commented 3 years ago

Would it work to use wb-manager reindex collection ? I was looking at the docs and didn't see this option described, but found it in the wb-manager help output. It looks like you could also use wb-manager index and specify the arc/warc files you want indexed. Not an auto-index, but would at least let you index the specific collections you're interested in.

ikreymer commented 3 years ago

Yes @riskoeser is correct, the wb-manager reindex <collection> should work. Sorry if this is unclear in the docs. The autoindexing is not meant to be used with TBs of data! Can you explain more about the use case? You're trying to combine data from commoncrawl + data archived locally via recording into a single collection? If so, there may be a way to configure that as an aggregate (the fixed common crawl data + quickly updating small collection) that are both searched at the same time..

jwest75674 commented 3 years ago

Great suggjestions and feedback folks!

If I recall correctly from my original issue, my concern was rooted in my hope to avoid a collection (200TB) from indexing when enabling auto-index. However I hope to use auto-index for a small and quickly updating collection.

The goal being worked towards was intended to be used as a (really massive) request+response cache, always serving from archive, unless unavailable, then making a live request. The hope was that this could grow as live requests were made, quickly updating as responses are received, indexed automatically, such that a second request in quick succession (10 seconds later, for example,) would find the previous response, or fallback to the wider, static, and local CommonCrawl collection.

My use case is the exploration and analysis of a really massive set of websites, for which I have the gargantuan set of (millions of) domain names, filtered/focused on a subset. --> I want to avoid slamming the web with requests for anything already in CommonCrawl, and then in future analysis of the same domains, avoid making any web requests at all, referencing the collection, again, like a cache.