Rewrite the cache using Cython on top of C++

jamadden commented 4 years ago

This should fix #186: Instead of 328 bytes of bookeeping data for each entry, we should be down to something like 88 for single values and 72+ for multiple values. (This is not counting the map, but I didn't count that before.)

Other upsides:

In the below benchmarks, master keeps over 400,000 objects alive in order to cache 97,655 values. The new code keeps only 3 objects alive.
This is so much faster than what we had before. Python 3.7 (Python 2.7 is similar):

Benchmark	Master	This branch
pop_eq	1.47 sec	498 ms: 2.95x faster (-66%)
pop_ne	1.64 sec	588 ms: 2.79x faster (-64%)
epop	1.25 sec	497 ms: 2.51x faster (-60%)
read	345 ms	182 ms: 1.90x faster (-47%)
mix	1.63 sec	599 ms: 2.72x faster (-63%)

Downsides:

In order to use modern-ish C++, I have to drop Python 2.7 on Windows because it is tied to an ancient compiler.
There's some copying overhead to get the Python byte strings out of the Python heap and into the C++ management. It doesn't seem to affect the benchmarks negatively, though. So peak memory usage is higher, but that should be transient. (My benchmarks don't currently capture that in a really great way; I'd like to have a better measure for sure before merging.) Copying back in to Python is avoided through the use of the buffer protocol, but only on Python 3. Certain operations on Python 2 (Connection.exportFile being the main one) don't work with buffers, so there is a copy made to read the data. Again, though, that should be very short lived.
It's new code in a memory-unsafe language 😢 (On the plus side the switch to C++ comes with advantages.) I've run the test suite through clang's address sanitizer without issues, but there could be issues lurking.

Before merging:

Need change note
Need better memory usage measurements
intelligently squash commits

jamadden commented 4 years ago

I've spent more time profiling and testing this. The current approach uses the standard library's double-linked list to manager the LRU order.

I can save one pointer per element if I roll a doubly-linked list manually and embed them in the nodes (similar to what the old C cache code did). That turned out to be a nightmare to get correct.

I could probably also save one pointer per element if I let the library containers allocate the nodes, but that would lose the ability to do the big bulk allocations.

Last, I can also save the equivalent of one pointer by using a deque of OIDs to hold the LRU list. That's not guaranteed O(1) (I have to do maintenance from time to time, but it should amortize out to that). We need an index in each node as well as the deque, so instead of three pointers per node it comes out to two.

None of those had compelling advantages AFAICS. I think we're in a decent sweet spot here. And performance has only improved:

Benchmark	Master	This branch (early)	This branch (now)
pop_eq	1.47 sec	498 ms: 2.95x faster (-66%)	353 ms
pop_ne	1.64 sec	588 ms: 2.79x faster (-64%)	551 ms
epop	1.25 sec	497 ms: 2.51x faster (-60%)	372 ms
read	345 ms	182 ms: 1.90x faster (-47%)	91.1 ms
mix	1.63 sec	599 ms: 2.72x faster (-63%)	522 ms

jamadden commented 4 years ago

After writing my own containers to get the minimal size improvements it occurred to me that boost might already have some...and sure enough they do, a whole "intrusive" library.

It (claims to) compile on the old Windows compiler for Python 2.7.
The license is BSD 3 clause, meaning its freely distributable.
It's header-only, so the build process shouldn't be complicated.
They're compatible with bulk memory management like I already do.
It supports embedding the same object into multiple containers. No containers-of-pointers needed!

The cache has two indices of objects, one LRU and one lookup. That's five pointers, all of which would be embedded with the single object. And that would be the total of the overhead.

With the std containers, the map needs three pointers per entry, plus a separate copy of the (pointer-sized) key for a total of four pointers. And the other index needs three pointers, plus the iterator-object embedded in the object, which I think is also two pointers, adding up to 5. In total, that's 9 pointers, four of which can go away, saving us 32 bytes per object (if I've done the math correctly). (And I think there's some other overhead in the unordered_map I'm not accounting for.)

Only downside I see is that we'd be switching from the hash-based lookup to a tree-based one: amortized O(1) to something logarithmic.

It shouldn't be hard to try out. It seems like it's worth doing.

zodb / relstorage

Rewrite the cache using Cython on top of C++ #358