Open traverseda opened 4 years ago
Looking at this further, write performance does not scale linearly with number of indexes. That is a huge problem for large files, and probably causes significant performance issues for larger datasets. Large datasets will never finish.
Looks like it's going to need some pretty big changes...
I presume that has something to do with the sorting entries.
How big are the WARC files you're dealing with?
You can also try the warcio index
operation, which might be be faster
The https://github.com/webrecorder/cdxj-indexer is an attempt to re-implement the indexer by extending the warcio indexing directly, and might be better.
The pywb indexer predates that and has a few other wrinkles for edge cases.
What do you mean by 'number of indexes'? Individual WARC files? It should be possible to run in parallel assuming you find the WARC boundary as you suggest.
I've also been thinking about non-plain text based indices. The ZipNum shared index is one such attempt and what the IA wayback machine uses. pywb already supports this. In the past, such indices have been generated via hadoop mapreduce jobs.
There is also OutbackCDX which is a RocksDB based index server compatible with pywb. Perhaps that will work for your use case?
How big are the WARC files you're dealing with?
Working from enwiki-20200201-pages-articles-multistream.xml.bz2
I have 72G of warc files, split up into 9GB files (have have an 8 core CPU, so 8 files). I can process all those in around an hour right now, but I'm not doing much more than reading the XML and splitting them into warc files (no rendering html, etc).
More importantly there are ~20 million pages.
What do you mean by 'number of indexes'?
Sorry, I meant "number of entries in an index". My tests start reading the file at ~150MB/s, but after 1.55G that speed drops down to 35MB/s. Not sure what's causing that, presumably some kind of expensive lookup operation is keeping it from being O(n)
. Does not appear to be sorting related.
If you want to time the indexing directly, you can run time cdx-indexer <WARC> file
for the pywb indexer. The warcio indexer I believe is a bit faster, so there may be room for optimization.
The indexing is dependent on the number of entries, not just the total size, since an entry is created for each HTTP response.
For example, indexing 1GB of hundreds of small html pages will be slower than 1GB of a single video, which is mostly skipped over. The variability of the HTTP responses will definitely have an effect on the speed.
Hopefully you only need to do the indexing as part of a one-time conversion, right? Is the 72GB WARC taking hours? That doesn't seem right, should be a few mins.
Huh, I used warcio to create these warc files, and I haven't gotten any errors, so I'd be surprised if the input I was feeding it was the problem.
I'm not getting anywhere near that speed though. I'm running time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc
and wb-manager index wiki collections/wiki/archive/1.warc
.
I'll let you know how it goes.
So it looks like unsorted CDX files return in a reasonable/expected amount of time.
time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc
370.75user 7.78system 6:22.46elapsed 98%CPU (0avgtext+0avgdata 34036maxresident)k
But sorted CDX indexes take much longer
time cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc
2378.10user 13.55system 40:19.31elapsed 98%CPU (0avgtext+0avgdata 3143912maxresident)k
time wb-manager index wiki collections/wiki/archive/1.warc
2377.91user 14.92system 40:20.93elapsed 98%CPU (0avgtext+0avgdata 2318456maxresident)k
That seems to get progressively longer the more items are in the index, but it doesn't take very long for it to get unusable.
Sorted we're taking ~40 minutes to generate 2491023 records. Let's try doubling that by using 2 files. I'm presuming it will take much more than 80 minutes.
time cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc
ime cdx-indexer --sort ./0.cdx collections/wiki/archive/0.warc collections/wiki/archive/1.warc
7382.09user 33.70system 2:05:19elapsed 98%CPU (0avgtext+0avgdata 6278660maxresident)k
40 minutes for one file, 120 minutes for two files.
If I get this fixed so it runs in linear-ish time on https://github.com/webrecorder/cdxj-indexer is it reasonable for pywb to depend on it? I think the only thing cdxj-indexer is missing is sorting, and I know I can get that working in linear-ish time, the implementation is pywb is a bit challenging.
I'll probably also add multiprocessing to it, so it can utilize more CPU cores.
Hmm, sorting is generally O(n log(n))
at best. So, unless we can pre-prepare the data, or unless something else is going on (e.g. this is actually arising from some I/O issue), we can't expect sorting to be linear time.
One way to see if there's much to be gained by changing the cdxj-indexer
is to try using the normal UNIX sort
command on the unsorted CDX files (see https://github.com/webrecorder/pywb/wiki/Pywb-Configuration#creating-cdx-from-warcs-or-arcs for details). If that's significantly faster then we know there's gains to be had.
I'd consider O(n log(n))
to be linear-ish ;p Python's timsort can even do better than that, on average, somewhere between O(n)
and O(n log(n))
. Since my input is already semi-sorted, I expect timsort to do a lot better than the worst case.
I think we're doing a lot worse than O(n log(n))
right now though. I'm doing a very similar sort on this same data-set and it takes less than a minute. I can process the whole 19 million item file into an ordered lmdb index and an accompanying warc file in ~2.5 hours.
As compared to time cdx-indexer ./0.cdxj ./collections/wiki/archive/0.warc
which takes 6 minutes 22 seconds, time cdx-indexer collections/wiki/archive/0.warc | sort > sorted.cdx
takes 6 minutes 32 seconds.
So yeah, there's definitely some room for improvement ;p
We're currently sitting a O(n^2)
I suspect? That should be a pretty simple fix in the cdxj-indexer command, since it's not even handling sort. While I'm at it I might as well add some multiprocessing and tqdm
for progress bars. I tried to figure out what exact part of the pywb indexer was slowing that down, but I could not. I suspect it has something to do with the use of insort
on line 170 of indexer/cdxindexer.py
, but I'm having a hard time untangling it. Insort has a much worse expected average performance than just calling list.sort()
, unless your sorting key is very expensive to calculate.
I tried removing insort
and it didn't seem to help, so I think that might not be the major problem, but the whole thing is pretty over-engineered so it would be nice to be able to use the simpler implementation found at https://github.com/webrecorder/cdxj-indexer
Hah, fair enough! Definitely some room for improvement there... :-)
I was going to suggest just testing with unix sort
, I see you've already done that : )
yes, the intent was for cdxj-indexer
to replace the built in indexer in pywb, but haven't had time to fully do that, so feel free to work with that, and then maybe can integrate it soon.
Yes, it was just using bisect.insort
but i suspect that may not be working as intended for whatever reason.. worse case, can fall back on unix sort
. Sorting was one of the things on the todo list for cdxj-indexer, so that would definitely help move things along on switching to it!
Thanks for looking at this!
So after playing around with the much simplified implementation in https://github.com/webrecorder/cdxj-indexer I'm very confident that the problem exists in the underlying warcio library, not in the pywb wrapping layers.
I don't know, I'll have to look at it when I've got more time.
So the base warcio does not have this issue, but cdxj-indexer does, regardless of the use of sort
.
Large files take a very long time to index. I'm toying with the idea of using warc files instead of ZIM files for a kiwix-like offline wikipedia, I can generate a full dump from a wiki xml backup in about 3 hours, but actually indexing that warc files takes over a day, I'm not really sure how long it takes in total.
A multiprocessing approach can help here, split large files into N chunks, iterate until you find the beginning of a WARC entry, than keep going until you hit the end of your segment.
I do wonder if we need a better database format though. Maybe use sqlite as our index?