[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap

GoogleCodeExporter commented 9 years ago

We run tcmalloc on our very high throughput MySQL master. Join buffers > 
kMaxPages are allocated by a large percentage of client threads. Over time, as 
fragmentation increases, the large_.returned list grows to contain thousands of 
Spans, which makes PageHeap::AllocLarge extremely slow and the pageheap_lock 
becomes a major source of contention, eventually slowing down MySQL 
considerably.

I have written a patch which maintains a doubly linked skiplist to calculate 
address-ordered best-fit in amortized O(log(n)) instead of O(n) when the 
combined size of the large_ normal and returned lists exceed a configurable 
threshold. We're running it in production now and according to perf top, the 
percentage of time spent in PageHeap::AllocLarge has gone from ~8% overall to 
~0.4%.

However, while the time complexity of the skiplist is favorable, its space 
complexity (each node is 168 bytes) is obviously substantially less favorable 
than something like a red/black tree. I haven't had a chance to do my own 
measurements yet, but it seems from everything I've read that allocator 
metadata size has been shown to have a substantial impact on CPU cache 
efficiency, so while this patch may lessen a scalability bottleneck, it may 
have negative performance effects. I'm up for implementing a different data 
structure for this if that has a better chance of being accepted.

I've also written a small test program that attempts to create a sizeable 
large_ list, and then times 100000 allocations and frees from the large list. 
It's far from perfect, but it's something.

Patch: https://github.com/jamesgolick/gperftools/compare/master...skiplist-trunk
Test program: https://gist.github.com/jamesgolick/cffa78146af264c191fb

Original issue reported on code.google.com by jamesgol...@gmail.com on 21 May 2013 at 11:38

GoogleCodeExporter commented 9 years ago

Second bug appears to be caused by unsignedness of i and countdown loop.

Original comment by alkondratenko on 21 Jul 2014 at 2:44

GoogleCodeExporter commented 9 years ago

Yeah, you're right. This is running consistently with no crashes now for me. 
What are the next steps?

Original comment by jamesgol...@gmail.com on 21 Jul 2014 at 3:12

GoogleCodeExporter commented 9 years ago

The code needs cleanup. I'm seeing:

* Trailing whitespace

* spurious whitespace-only changes (span.h)

* unused fields (large_lists_size, kLargeLLRBThreshold) and methods 
(InitializeLargeLLRB, LLRB::Include)

There are seemingly larger problems as well:

* leaking of rbn_right_red and friends to llrb.h (and details of RB_COMPACT)

* LLRB::GetBestFit looks wrong. Consider a case where best node is somewhere in 
the middle of the tree. With both left and right pointers non-nil. My 
understanding is that rb.h API is able to deal with "find least node that's 
larger than given value". I used it in minimalloc (although my code is based on 
rbtree from libbsd).

* it seems best if we embed LLRBNode into Span. For performance and memory 
usage and etc.

And in addition to that we really need some at least basic unit test that 
covers this new code. I'm sure it must be possible to write some unit-test that 
only deals with LLRB and doesn't require much of rest of code.

And thanks for working on this. It is really helping massively.

Original comment by alkondratenko on 22 Jul 2014 at 5:20

GoogleCodeExporter commented 9 years ago

Ah. Forgot another important downside:

* old code was searching for match among normal spans first and used returned 
spans after that. Your code is not keeping that important behavior.

Original comment by alkondratenko on 22 Jul 2014 at 5:42

GoogleCodeExporter commented 9 years ago

I don't see any trailing whitespace over here. Can you point it out?

I fixed the unused fields and functions.

I think get best fit is actually right because it has the second condition on 
the leftward movement regarding page size. I'll write some tests to try to 
prove that. There is a function in rb.h that will do something *similar* but 
not quite the same. I could add a function to rb.h to do what we want if you'd 
prefer it there.

Regarding embedding llrbnode in to span, I'm not sure. It actually seems to me 
like having the tree nodes packed together might be faster?

Regarding the old code matching normal spans first, unless I'm misreading it I 
don't think that's what it actually does? Seems like it prefers the left-most 
address over a span off of the normal list. If you want to prefer spans from 
the normal list, we can make 2 trees. Or change the algorithm a little.

Original comment by jamesgol...@gmail.com on 22 Jul 2014 at 2:02

GoogleCodeExporter commented 9 years ago

>> I don't see any trailing whitespace over here. Can you point it out?

GetBestFit has few trailing tabs on first if line

>> Regarding embedding llrbnode in to span, I'm not sure. It actually seems to 
me like having the tree nodes packed together might be faster?

Probably not. Particularly during tree traversal you have to touch span fields 
(size & addr) anyways. So you're just adding another pointer and extra memory 
or cache access latency.

>> Regarding the old code matching normal spans first, unless I'm misreading it 
I don't think that's what it actually does?

You're right due to address ordered best fit it's not aggressive enough. We 
definitely should be doing something more aggressive there because actually 
preferring normal span to returned space is the right thing to do. The simplest 
thing we can do is to tweak comparison so that it looks at size first, then 
normal/returned, then address. But I'd also consider something even stronger. 
I.e. preferring normal but slightly larger than best returned span.

And some more thoughts.

* Ideally we'd find some way to avoid exposing rb.h into headers. Or maybe we 
could expose just smallest part of it. If we don't embed LLRBNode into span we 
can actually do it right now (llrb.h don't really need full definition of this 
struct). But I think embedding is the right thing to do so we might want to 
split part of rb.h that defines structures to include into headers.

* GetBestFit is certainly not right. Consider another case. You're walking to 
the left and find that your current node's left child is smaller than size you 
need. And you stop right there, ignoring possibility of right subtree of that 
child having better fit.

* GetBestFit apparently accesses rbn_right_red incorrectly. Without taking into 
account redness bit stored in it's least bits.

* I'm slightly concerned about couple cases of pattern like this: walk the tree 
to find something, then walk the tree again to delete it. But this is probably 
not big deal.

* rb.h lacks copyright header. Should be easy to fix.

Original comment by alkondratenko on 22 Jul 2014 at 5:13

GoogleCodeExporter commented 9 years ago

Hi:

My best friend, James Golick, passed away unexpectedly nearly 2 weeks ago. He 
and I discussed this patch at length and I would like to pick up the work where 
he left off, as I know how much this meant to him to have this merged.

As such I have:

* Added a license blurb and copyright attribution to rb.h (I contacted the 
original author to verify).
* Split out structure definitions from rb.h into rb_defs.h. I used the 
copyright attribution of the original rb.h, as I didn't really do anything 
except copy code around.

You can find these changes here: 
https://github.com/ice799/gperftools/compare/googlecode-master...llrb

From reading this thread, it seems the remaining pieces are:

* Some sort of unit test for this.
* Some issues with GetBestFit.

He had removed the spurious whitespace and unused fields and methods himself.

Please let me know if my understanding of what is remaining to have this merged 
is correct.

Thanks.

Original comment by ice...@gmail.com on 9 Jan 2015 at 4:23

GoogleCodeExporter commented 9 years ago

I'm very sorry to hear that.

Indeed those two issues remain. On GetBestFit we need it to find least-address 
chunk that is larger than allocation request. It is doable efficiently in 
log(n) time.

Original comment by alkondratenko on 10 Jan 2015 at 7:27

xfun68 / gperftools

[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap #532