Faster indexing - Githubissues

traverseda commented 4 years ago

Large files take a very long time to index. I'm toying with the idea of using warc files instead of ZIM files for a kiwix-like offline wikipedia, I can generate a full dump from a wiki xml backup in about 3 hours, but actually indexing that warc files takes over a day, I'm not really sure how long it takes in total.

A multiprocessing approach can help here, split large files into N chunks, iterate until you find the beginning of a WARC entry, than keep going until you hit the end of your segment.

I do wonder if we need a better database format though. Maybe use sqlite as our index?

traverseda commented 4 years ago

Looking at this further, write performance does not scale linearly with number of indexes. That is a huge problem for large files, and probably causes significant performance issues for larger datasets. Large datasets will never finish.

Looks like it's going to need some pretty big changes...

I presume that has something to do with the sorting entries.

ikreymer commented 4 years ago

How big are the WARC files you're dealing with?

You can also try the warcio index operation, which might be be faster

The https://github.com/webrecorder/cdxj-indexer is an attempt to re-implement the indexer by extending the warcio indexing directly, and might be better.

The pywb indexer predates that and has a few other wrinkles for edge cases.

What do you mean by 'number of indexes'? Individual WARC files? It should be possible to run in parallel assuming you find the WARC boundary as you suggest.

I've also been thinking about non-plain text based indices. The ZipNum shared index is one such attempt and what the IA wayback machine uses. pywb already supports this. In the past, such indices have been generated via hadoop mapreduce jobs.

There is also OutbackCDX which is a RocksDB based index server compatible with pywb. Perhaps that will work for your use case?

traverseda commented 4 years ago

How big are the WARC files you're dealing with?

Working from enwiki-20200201-pages-articles-multistream.xml.bz2 I have 72G of warc files, split up into 9GB files (have have an 8 core CPU, so 8 files). I can process all those in around an hour right now, but I'm not doing much more than reading the XML and splitting them into warc files (no rendering html, etc).

More importantly there are ~20 million pages.

What do you mean by 'number of indexes'?

Sorry, I meant "number of entries in an index". My tests start reading the file at ~150MB/s, but after 1.55G that speed drops down to 35MB/s. Not sure what's causing that, presumably some kind of expensive lookup operation is keeping it from being O(n). Does not appear to be sorting related.

ikreymer commented 4 years ago

If you want to time the indexing directly, you can run time cdx-indexer <WARC> file for the pywb indexer. The warcio indexer I believe is a bit faster, so there may be room for optimization.

The indexing is dependent on the number of entries, not just the total size, since an entry is created for each HTTP response.

For example, indexing 1GB of hundreds of small html pages will be slower than 1GB of a single video, which is mostly skipped over. The variability of the HTTP responses will definitely have an effect on the speed.

Hopefully you only need to do the indexing as part of a one-time conversion, right? Is the 72GB WARC taking hours? That doesn't seem right, should be a few mins.

traverseda commented 4 years ago

Huh, I used warcio to create these warc files, and I haven't gotten any errors, so I'd be surprised if the input I was feeding it was the problem.

I'm not getting anywhere near that speed though. I'm running time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc and wb-manager index wiki collections/wiki/archive/1.warc.

I'll let you know how it goes.

traverseda commented 4 years ago

So it looks like unsorted CDX files return in a reasonable/expected amount of time.

time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc
370.75user 7.78system 6:22.46elapsed 98%CPU (0avgtext+0avgdata 34036maxresident)k

But sorted CDX indexes take much longer

time cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc
2378.10user 13.55system 40:19.31elapsed 98%CPU (0avgtext+0avgdata 3143912maxresident)k

time wb-manager index wiki collections/wiki/archive/1.warc
2377.91user 14.92system 40:20.93elapsed 98%CPU (0avgtext+0avgdata 2318456maxresident)k

That seems to get progressively longer the more items are in the index, but it doesn't take very long for it to get unusable.

Sorted we're taking ~40 minutes to generate 2491023 records. Let's try doubling that by using 2 files. I'm presuming it will take much more than 80 minutes.

time cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc

traverseda commented 4 years ago

ime cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc
7382.09user 33.70system 2:05:19elapsed 98%CPU (0avgtext+0avgdata 6278660maxresident)k

40 minutes for one file, 120 minutes for two files.

traverseda commented 4 years ago

If I get this fixed so it runs in linear-ish time on https://github.com/webrecorder/cdxj-indexer is it reasonable for pywb to depend on it? I think the only thing cdxj-indexer is missing is sorting, and I know I can get that working in linear-ish time, the implementation is pywb is a bit challenging.

I'll probably also add multiprocessing to it, so it can utilize more CPU cores.

anjackson commented 4 years ago

Hmm, sorting is generally O(n log(n)) at best. So, unless we can pre-prepare the data, or unless something else is going on (e.g. this is actually arising from some I/O issue), we can't expect sorting to be linear time.

One way to see if there's much to be gained by changing the cdxj-indexer is to try using the normal UNIX sort command on the unsorted CDX files (see https://github.com/webrecorder/pywb/wiki/Pywb-Configuration#creating-cdx-from-warcs-or-arcs for details). If that's significantly faster then we know there's gains to be had.

traverseda commented 4 years ago

I'd consider O(n log(n)) to be linear-ish ;p Python's timsort can even do better than that, on average, somewhere between O(n) and O(n log(n)). Since my input is already semi-sorted, I expect timsort to do a lot better than the worst case.

I think we're doing a lot worse than O(n log(n)) right now though. I'm doing a very similar sort on this same data-set and it takes less than a minute. I can process the whole 19 million item file into an ordered lmdb index and an accompanying warc file in ~2.5 hours.

As compared to time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc which takes 6 minutes 22 seconds, time cdx-indexer collections/wiki/archive/0.warc | sort > sorted.cdx takes 6 minutes 32 seconds.

So yeah, there's definitely some room for improvement ;p

We're currently sitting a O(n^2) I suspect? That should be a pretty simple fix in the cdxj-indexer command, since it's not even handling sort. While I'm at it I might as well add some multiprocessing and tqdm for progress bars. I tried to figure out what exact part of the pywb indexer was slowing that down, but I could not. I suspect it has something to do with the use of insort on line 170 of indexer/cdxindexer.py, but I'm having a hard time untangling it. Insort has a much worse expected average performance than just calling list.sort(), unless your sorting key is very expensive to calculate.

I tried removing insort and it didn't seem to help, so I think that might not be the major problem, but the whole thing is pretty over-engineered so it would be nice to be able to use the simpler implementation found at https://github.com/webrecorder/cdxj-indexer

anjackson commented 4 years ago

Hah, fair enough! Definitely some room for improvement there... :-)

ikreymer commented 4 years ago

I was going to suggest just testing with unix sort, I see you've already done that : ) yes, the intent was for cdxj-indexer to replace the built in indexer in pywb, but haven't had time to fully do that, so feel free to work with that, and then maybe can integrate it soon.

Yes, it was just using bisect.insort but i suspect that may not be working as intended for whatever reason.. worse case, can fall back on unix sort. Sorting was one of the things on the todo list for cdxj-indexer, so that would definitely help move things along on switching to it!

Thanks for looking at this!

traverseda commented 4 years ago

So after playing around with the much simplified implementation in https://github.com/webrecorder/cdxj-indexer I'm very confident that the problem exists in the underlying warcio library, not in the pywb wrapping layers.

traverseda commented 4 years ago

I don't know, I'll have to look at it when I've got more time.

traverseda commented 4 years ago

So the base warcio does not have this issue, but cdxj-indexer does, regardless of the use of sort.

warcio svg

cdxjindexer svg

webrecorder / pywb

Faster indexing #541