Using a Bucket instead of a BTree can be nearly 50% faster on MVCC _vacuum

zodb / relstorage

A backend for ZODB that stores pickles in a relational database.

Other

53 stars 46 forks source link

(According to variations on the included microbenchmark; I'm not sure how realistic this is compared to #474, but I did see some long-ish vacuum times of 1-4s, and it tests two different extremes.)

This is because the major bottleneck is the keys() method, which is a BTree multiunion. That function is at its fastest given a Bucket or Set.

I'm not sure what unintended impacts this change might have. Probably none? Bucket has very different performance charecteristics as it gets larger, but ZODB's fsIndex claims that's not an issue with a 65K entry bucket (at least as long as data is inserted ordered --- I couldn't detect a difference with buckets of size 250). Most buckets we use are probably on the smaller end of that scale, except the initial one we restore from cache.

I'm still contemplating other data structure or algorithm improvements, this may not move forward.

So this obviously took a different direction, as I ended up moving some of the low-level data structures into C++ via Cython. This results in the cache microbenchmarks being at least 50% faster, up to 40 times faster, depending on the data. I'm now contemplating doing some bigger algorithm changes to somewhat ameliorate that data-dependency, because this is as far as this one can go.

I ran the full zodbshootout benchmark suite with all three types of concurrency (gevent, thread, process). There were no substantial differences. But those tests weren't designed to stress out the cache in this way. Probably with a lot more objects or in-process concurrency than I was usefully able to achieve there might be a difference.

zodb / relstorage

Using a Bucket instead of a BTree can be nearly 50% faster on MVCC _vacuum #479