webrecorder / cdxj-indexer

CDXJ Indexing of WARC/ARCs
Apache License 2.0
21 stars 10 forks source link

Feature Requests / questions on use --> Pipe, Readme #7

Open jwest75674 opened 4 years ago

jwest75674 commented 4 years ago

Few Feature requests and/or requests for help using cdxj-indexer! --> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to our respective projects. Nothing like global pandemic to make time for hobby projects for myself. haha

One of the first things I tried was trying to pipe the output of a command to cdxj-indexer, but that simply does not work. Whats the recommended method to get the output of a command run in this fashion? (Forgive me if I'm missing something primitive, still learning.) --> While simple bash scripts are the most likely culprit for trying to pipe to cdxj-indexer, I have a gzip hardware accelerator (FPGA, real world throughput over 1GB per second in either direction), which would work really well if I could pipe my-fast-funzip file.warc.gz | cdxj-indexer. I am working on a python wrapper for my-fast-funzip though, as this need keeps popping up.

As well, when looking at --help, I see some other flags which I am having trouble finding documentation on, such as --compress and --lines. Is there a more robust readme kicking about somewhere that I simply missed?

Lastly multiprocessing would be a god-send My machine's cpu threads are relatively slow, as it's an old server, but similarly, it's a server with 48 cores, 96 threads. Generally speaking, I am likely not the only one who will find their way here by working with CommonCrawl warcs. I have ~40TB of warc.gz data to work through, so the use of the gzip FPGA and multiprocessing would reduce the time required for this step by a few orders of magnitude.

I'll likely work on a multiprocessing solution myself. In the past, I've handled multiprocessed writing to one file with the logging library. I believe the cdxj format is fine with an arbitrary line order, as I see sorting functionality here, is that correct? --> Unless I hear a someone volunteer to help a beginner clean their code up, I likely won't make a pull request.

TLDR:

  1. Is there a method to pipe to cdxj-indexer? If not, this is a feature request
  2. Multiprocessing capabilities for those of us with more warc data than time to wait.
ikreymer commented 4 years ago

You can use pipe to cdxj-indexer by passing in - as the input filename, eg: cat ./my-warc.warc.gz | cdxj-indexer -

The --compress and --lines features are designed to generate a compressed index, similar to the one used by CommonCrawl. It compresses every N lines and also produces an outer secondary index. Yes, the docs need a bit of updating!

If you are using CommonCrawl data, I believe the cdxj indices should also be available for download along with the WARCs so you don't need to reindex the data..

The trick with parallel processing is finding the boundaries of the gzip records, which may not be too bad.. I'll track this issue to update the README, I don't think I'll have time for parallel processing in the near future unfortunately..

jwest75674 commented 4 years ago

Thanks for the prompt reply, and the pointer with regard to pipe, using - is obviously new to me.

With regard to CommonCrawl, while I am using that data, I am functionally only interested in website homepages. My project to date has revolved around sorting through the dataset to extract only homepages, and ideally nothing rated XXX (lol). As such the existing indexes are of little value to me (this is why I am here, looking to reindex.)

I took a step back and realized that I am in a good position to multiprocess on a per-file basis, as my eventual files match this format / naming convention, which I get the impression will make sense to you: CC-MAIN-2020-16_cdx-00040.warc.gz

This may be off-topic for this issue, if so, I understand: At first glance, it seems most straight forward to output to respective CC-MAIN-2020-16_cdx-00040.cdxj indexes. Is this expected to negatively impact search time during playback? (Having many many small indexes, instead of a single, large, sorted index?