[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap

GoogleCodeExporter commented 9 years ago

We run tcmalloc on our very high throughput MySQL master. Join buffers > 
kMaxPages are allocated by a large percentage of client threads. Over time, as 
fragmentation increases, the large_.returned list grows to contain thousands of 
Spans, which makes PageHeap::AllocLarge extremely slow and the pageheap_lock 
becomes a major source of contention, eventually slowing down MySQL 
considerably.

I have written a patch which maintains a doubly linked skiplist to calculate 
address-ordered best-fit in amortized O(log(n)) instead of O(n) when the 
combined size of the large_ normal and returned lists exceed a configurable 
threshold. We're running it in production now and according to perf top, the 
percentage of time spent in PageHeap::AllocLarge has gone from ~8% overall to 
~0.4%.

However, while the time complexity of the skiplist is favorable, its space 
complexity (each node is 168 bytes) is obviously substantially less favorable 
than something like a red/black tree. I haven't had a chance to do my own 
measurements yet, but it seems from everything I've read that allocator 
metadata size has been shown to have a substantial impact on CPU cache 
efficiency, so while this patch may lessen a scalability bottleneck, it may 
have negative performance effects. I'm up for implementing a different data 
structure for this if that has a better chance of being accepted.

I've also written a small test program that attempts to create a sizeable 
large_ list, and then times 100000 allocations and frees from the large list. 
It's far from perfect, but it's something.

Patch: https://github.com/jamesgolick/gperftools/compare/master...skiplist-trunk
Test program: https://gist.github.com/jamesgolick/cffa78146af264c191fb

Original issue reported on code.google.com by jamesgol...@gmail.com on 21 May 2013 at 11:38

GoogleCodeExporter commented 9 years ago

Note that 
https://github.com/jamesgolick/gperftools/commit/1a87924c05ce33eb9b2b5ba78c38ba9
d8653838b is needed to run the test program against trunk (to display the 
length of the combined large lists).

Original comment by jamesgol...@gmail.com on 22 May 2013 at 4:49

GoogleCodeExporter commented 9 years ago

I was thinking about rbtree. I.e. either from libbsd or from STL (which would 
require some carefulness with it's allocators)

Original comment by alkondratenko on 6 Jul 2013 at 7:57

GoogleCodeExporter commented 9 years ago

Original comment by alkondratenko on 6 Jul 2013 at 11:12

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Forgot to say: I'll take a look your stuff soon. Closest Saturday latest. 
Thanks a lot for taking your time to do something about this issue.

Original comment by alkondratenko on 8 Jul 2013 at 5:21

GoogleCodeExporter commented 9 years ago

I think this is good enough. And I think it's quite strong candidate for 2.2 
release.

I was thinking that maybe we should look at entire page spans allocator as a 
whole. And implement efficient and scalable data structure for it. Our 
competition is using rb-trees and b-trees. We can also consider doing simpler, 
dl-malloc-like organization of page span size class bins.

And btw independently from tcmalloc and for some couchbase server experiment 
I've implemented a simple rb-tree based allocator here: 
https://github.com/alk/minimalloc

Anyway I think dealing with large_ spans is a good start and we can always 
continue improving later.

Having said all that I think we should also double check if there are other 
things we should consider improving instead or in addition to scalable large_ 
spans searching.

Particularly I'm wondering how come allocating more than 1 meg chunks can be a 
notice-able bottleneck even after your optimization ? My understanding is given 
that using that big chunk is probably going to be at least hundreds of 
thousands of cycles, optimized malloc-ing of those chunks should not be even 
notice-able on profile. Unless either I'm missing something (like big chunks 
are allocated, yet only fraction of chunks are actually used) or perhaps there 
are some other reasons of slowness.

Original comment by alkondratenko on 13 Jul 2013 at 10:15

GoogleCodeExporter commented 9 years ago

It's hard to know the answer to that question without seeing the test program 
you're using.

I have some other ideas, like I think we should add some size classes above 
1Mb, but this is still important and needs solving either way.

I have some patches that take a running sample of large allocations and attempt 
to determine which additional size classes are worth adding to the thread 
caches based on that data (and then adds them). I think there's something 
there, but a lot more thought needs to go in to it.

I do think that this is a good solution for now, and we can think about adding 
more stuff later, unless you've got a better idea that we can implement 
immediately?

Incidentally, I'd like to get a lot more involved in this project. Is there 
anything you need help with?

Original comment by jamesgol...@gmail.com on 15 Jul 2013 at 11:16

GoogleCodeExporter commented 9 years ago

I was wondering why in your reported case amount of time spent allocating span 
out of large_ has dropped only to 0.4%.

It's good that you have a tool to record page span allocations. I think that 
would be really valuable so that we can capture some particular allocation 
pattern and then explore it offline using potentially different allocation 
implementations.

So with that tool maybe we can dig more into why it's just 0.4% and not 
something far smaller.

Regarding help. Take a look at tickets. There are few OSX tickets that may need 
some help. Otherwise consider reviewing the things at the following threads: 
https://groups.google.com/forum/#!topic/google-perftools/8AZz3GNxH-o and 
https://groups.google.com/forum/#!topic/google-perftools/PQR4zQ6CQYs

Also given we're nearly at rc any further testing on any platforms you care 
about is nice help as well.

Original comment by alkondratenko on 16 Jul 2013 at 3:10

GoogleCodeExporter commented 9 years ago

I don't think that it's still particularly a bottleneck, actually - it's just 
that MySQL allocates a lot of big chunks.

For example, tcmalloc::SLL_Next(void*) is currently at 0.5% in perf top, and 
PageHeap::AllocLarge isn't on the first page at all.

Original comment by jamesgol...@gmail.com on 19 Jul 2013 at 6:35

GoogleCodeExporter commented 9 years ago

Looks like we're having some misunderstanding.

I'm simply wondering why you saw such high percentage of time spent in this 
particular code without patch. I.e. after allocating big buffer I expect 
application to do something reasonably heavy with all of it. And why with patch 
it only went to 0.4%. 

I'm wondering because who knows if there's something else that's asking for 
attention in this particular case.

Original comment by alkondratenko on 24 Jul 2013 at 2:50

GoogleCodeExporter commented 9 years ago

Been on vacation for a while, so sorry for the delay, but...

As for why GetBestFit can still be a bottleneck, perf top shows that most of 
the time in that function is being spent in this comparison: 

https://github.com/jamesgolick/gperftools/commit/011657b62e85102346f4dc2219bfa2d
22fa8a89d#L3R123

about 25% addressing the Span object:

 27.05 │      mov    0x8(%rax),%rax

and 50% mispredicting the branch:

 44.08 │      cmp    -0x30(%rbp),%rax

Did my addition of a pointer to the Span objects or the size of the skip lists 
noes cause some regression with respect to them fitting in cache or something? 
I'm not sure. Ideas?

Original comment by jamesgol...@gmail.com on 28 Sep 2013 at 6:26

GoogleCodeExporter commented 9 years ago

Interesting result. Is there any reasonably simple test I can run myself to dig 
deeper ?

Original comment by alkondratenko on 29 Sep 2013 at 2:12

GoogleCodeExporter commented 9 years ago

I was using something like this 
https://gist.github.com/jamesgolick/9a02b796693b044d21e2 to create some 
fragmentation and inflate the size of the large lists - perhaps just modify the 
thread_loops to run forever?

Original comment by jamesgol...@gmail.com on 29 Sep 2013 at 5:42

GoogleCodeExporter commented 9 years ago

Been thinking about this a lot, and I'm really not sure how to write a test to 
expose this issue.

Is there any data that I can gather from my running mysqld that you think might 
give us some clues?

Original comment by jamesgol...@gmail.com on 30 Sep 2013 at 5:04

GoogleCodeExporter commented 9 years ago

Perhaps you can try to record allocation/free requests via hooks somehow. The 
challenge is obviously going to be to efficiently and correctly record temporal 
relations between different threads.

Original comment by alkondratenko on 30 Sep 2013 at 6:19

GoogleCodeExporter commented 9 years ago

Any other ideas for how we can design a test to expose this? This ticket 
addresses a major scalability issue and it would be great to find a way to be 
able to get the patch ready for acceptance.

Original comment by jamesgol...@gmail.com on 3 Oct 2013 at 1:35

GoogleCodeExporter commented 9 years ago

Maybe you have some particular MySQL benchmark that exposes this condition ?

Original comment by alkondratenko on 3 Oct 2013 at 2:44

GoogleCodeExporter commented 9 years ago

The following demonstrates O(n) very well too: 
https://gist.github.com/alk/6956552

I've tried your patch (rebasing it against master). And:

*) it fails to compile (due to missing stdio.h include's for fprintf)

*) make check and program above crash with segmentation fault

Let's get your code in commit-ready shape (no fprintfs please and not crashes) 
and try to exercise it as much as we can.

Original comment by alkondratenko on 13 Oct 2013 at 12:31

GoogleCodeExporter commented 9 years ago

Ok - I'm on it.

Original comment by jamesgol...@gmail.com on 16 Oct 2013 at 10:00

GoogleCodeExporter commented 9 years ago

Okay, fixed. New patch is here: 
https://github.com/jamesgolick/gperftools/compare/trunk...skiplist-trunk

I'm also working on implementing a left-leaning red black tree to compare 
performance.

Original comment by jamesgol...@gmail.com on 30 Oct 2013 at 4:22

GoogleCodeExporter commented 9 years ago

Note that you can take one from jemalloc. Or from libbsd (as I did for my 
minimalloc experiment).

Unless of course you actually want to implement it yourself.

BTW jemalloc author has blob post somewhere on performance of various rbtree 
implementation options. Might be worth checking out.

Original comment by alkondratenko on 30 Oct 2013 at 4:36

GoogleCodeExporter commented 9 years ago

So, I used the jemalloc code, and got a 20% improvement over my skip list using 
your random_mallocer as a benchmark. I would imagine that the reduced space 
complexity would make an even bigger difference in a more realistic test 
because there would be so much less cache contention walking the tree.

https://github.com/jamesgolick/gperftools/compare/trunk...llrb

The only thing is that we probably don't want to use the jemalloc code since 
it's implemented as an 800-line set of macros... If we want to go in this 
direction, we can: 1) expand the jemalloc macros and clean up the code, 2) find 
another rbtree to use, 3) actually implement our own.

Original comment by jamesgol...@gmail.com on 1 Nov 2013 at 4:33

GoogleCodeExporter commented 9 years ago

I'll need clean patches without traces of "internal" fixes (like fprintf).

Also please do rebase against latest master:

git rebase --onto googlecode/master jamesgolick/trunk

Original comment by alkondratenko on 17 Nov 2013 at 3:05

GoogleCodeExporter commented 9 years ago

llrb functions as well as rb.h should be included in .cc not .h

Original comment by alkondratenko on 17 Nov 2013 at 3:33

GoogleCodeExporter commented 9 years ago

There are also fixable warnings:

src/page_heap.h: In constructor 'tcmalloc::PageHeap::PageHeap()':
src/page_heap.h:303:11: warning: 'tcmalloc::PageHeap::scavenge_counter_' will 
be initialized after [-Wreorder]
   int64_t scavenge_counter_;
           ^
src/page_heap.h:243:10: warning:   'size_t 
tcmalloc::PageHeap::large_lists_size_' [-Wreorder]
   size_t large_lists_size_;
          ^
src/page_heap.cc:64:1: warning:   when initialized here [-Wreorder]
 PageHeap::PageHeap()
 ^
In file included from src/static_vars.h:43:0,
                 from src/page_heap.cc:43:
src/page_heap.h:306:7: warning: 'tcmalloc::PageHeap::release_index_' will be 
initialized after [-Wreorder]
   int release_index_;
       ^
src/page_heap.h:246:8: warning:   'bool tcmalloc::PageHeap::using_large_llrb_' 
[-Wreorder]
   bool using_large_llrb_;
        ^
src/page_heap.cc:64:1: warning:   when initialized here [-Wreorder]
 PageHeap::PageHeap()
 ^

Original comment by alkondratenko on 17 Nov 2013 at 3:35

GoogleCodeExporter commented 9 years ago

In general I do like the code. Just please, make it super clean. Without 
unrelated changes etc. I believe c-macro-based red-black tree code from 
jemalloc is fine.

BTW what LLRB means?

Also I'd not bother with linked list and kLargeLLRBThreshold. Consider making 
everything simpler by always keeping large spans in red-black tree.

I think we're close to merge-able code. Thanks a lot for your effort here.

Original comment by alkondratenko on 17 Nov 2013 at 3:46

GoogleCodeExporter commented 9 years ago

There's compilation error:

In file included from src/page_heap.cc:41:0:
src/page_heap.cc: In member function 'tcmalloc::Span* 
tcmalloc::PageHeap::AllocLarge(Length)':
src/page_heap.cc:219:10: error: 'bestNormal' was not declared in this scope
   ASSERT(bestNormal == NULL);
          ^
src/internal_logging.h:112:9: note: in definition of macro 'CHECK_CONDITION'
   if (!(cond)) {                                                         \
         ^
src/page_heap.cc:219:3: note: in expansion of macro 'ASSERT'
   ASSERT(bestNormal == NULL);
   ^
In file included from src/llrb.h:41:0,
                 from src/page_heap.h:45,
                 from src/static_vars.h:43,
                 from src/page_heap.cc:43:

This happens when -DNDEBUG is not passed to preprocessor

Original comment by alkondratenko on 17 Nov 2013 at 3:52

GoogleCodeExporter commented 9 years ago

Sorry to take so long on this - been busy with other stuff.

To address your feedback, I've fixed that assertion issue, and the init order 
warnings on PageHeap's constructor.

The warning in src/internal_logging.h wasn't related to my branch and seems to 
have been fixed since then. 

It's currently impossible to include rb.h in llrb.cc because we need some of 
the macros for typedefs that need to be in llrb.h. If you want me to try to 
hack up the rb.h stuff, I can.

Let me know if there's anything else you want changed.

I've rebased against master.

https://github.com/jamesgolick/gperftools/compare/googlecode-master...llrb

Original comment by ja...@packagecloud.io on 17 Jun 2014 at 4:01

GoogleCodeExporter commented 9 years ago

Will take a look. BTW what you think about TLSF? It might solve (almost) same 
problem with possibly less complexity.

Original comment by alkondratenko on 17 Jun 2014 at 7:00

GoogleCodeExporter commented 9 years ago

Just skimmed through the TLSF paper. That's a really interesting model - I 
think it's something that's definitely worth investigating, but it seems like 
it would be a substantial restructuring of tcmalloc to do it properly.

Original comment by ja...@packagecloud.io on 17 Jun 2014 at 8:03

GoogleCodeExporter commented 9 years ago

I was thinking about applying it only for page spans. I.e. instead of tree. But 
given you're done with your code I'm not sure anymore.

Original comment by alkondratenko on 17 Jun 2014 at 8:05

GoogleCodeExporter commented 9 years ago

Ya - I think we should try to get my code merged in and then investigate this 
for a future release. I can probably take it on sometime in the next month or 
so. It's a cool data structure.

Original comment by ja...@packagecloud.io on 17 Jun 2014 at 8:15

GoogleCodeExporter commented 9 years ago

But let me note that tree approach has additional advantage of being able to 
implement address-ordered best-fit instead of just "almost-best-fit".

Original comment by alkondratenko on 17 Jun 2014 at 8:15

GoogleCodeExporter commented 9 years ago

Indeed.

Original comment by ja...@packagecloud.io on 17 Jun 2014 at 8:20

GoogleCodeExporter commented 9 years ago

Hey - did you have any time to look at this? I've got some more time to work on 
it tomorrow.

Original comment by ja...@packagecloud.io on 19 Jun 2014 at 7:59

GoogleCodeExporter commented 9 years ago

Haven't had time to look at it yet. Saturday is the time when I'll for sure be 
able to handle it. And I cannot promise anything else. Main work and other life 
constraints.

Original comment by alkondratenko on 19 Jun 2014 at 8:03

GoogleCodeExporter commented 9 years ago

Cool - no prob. Thanks!

Original comment by ja...@packagecloud.io on 19 Jun 2014 at 8:25

GoogleCodeExporter commented 9 years ago

Indeed testing it is a big problem. It's likely that llrb code path is not even 
exercised once during make check.

So I've produced another test program to try to test it (in fact I forgot about 
my previous program). And with that program I'm still able to crash it. Program 
can be found here: https://gist.github.com/alk/6d8a907fccc8f18b414d. Note that 
it doesn't fit into 32-bit address space and requires 64-bit system.

Here's other feedback.

I think it's not useful to have chain of commits reflect evolution of this 
work. It'll just add confusion to folks doing git history reading. One combined 
commit looks fine. Or if you can split it in a way that improves reviewing your 
work and _reading_ history, feel free to do it as well (but it usually requires 
a bit of git-fu that is seemingly not comfortable for most).

I'm also seeing quite few compilation warnings introduced by this patch. It 
seems to be static functions introduced by llrb macros so they are harmless, 
but they're still warnings.

There are also few places where trailing whitespace is introduced. Surely it'll 
be trivial to fix.

I also think that switching llrb path on only on certain free list size is just 
adding complexity and greatly complicates testing. Are you sure that it is 
really needed? Why not just use llrb always?

Original comment by alkondratenko on 21 Jun 2014 at 9:49

GoogleCodeExporter commented 9 years ago

I fixed the compiler warnings using __attribute__((unused)).

When you are ready to accept the patch, I will squash all the commits in to 1 
or you can do it when you merge.

The trailing whitespace is not really there - just looks like that in the 
github compare view.

I pushed a commit that removes the switching to llrb path. PageHeap uses llrb 
all the time now.

I'm not able to repro your crash with your test program. I tried.

https://github.com/jamesgolick/gperftools/compare/googlecode-master...llrb

Original comment by ja...@packagecloud.io on 26 Jun 2014 at 8:20

GoogleCodeExporter commented 9 years ago

btw - I spent some time digging in to tlsf and even implemented a basic version 
of the data structure in a branch, and my conclusion is that it doesn't really 
fit in to the existing structure of tcmalloc.

The biggest thing is that it would be a pretty substantial tradeoff in terms of 
fragmentation because of the necessary rounding up of allocation sizes and 
because we can't address-order our best fit algorithm. Conceivably, we could 
use some kind of log(n) data structure instead of linked lists for each "size 
class" in the tlsf, but I'm not sure that would actually be worth the 
additional complexity, unless we can prove that log(n) is still too slow.

That said, I think it might be fun to build an allocator around the tlsf code I 
wrote, so I might do that if I can find some time.

Original comment by ja...@packagecloud.io on 6 Jul 2014 at 7:27

GoogleCodeExporter commented 9 years ago

I'm still seeing the crash on my boxes.

It is very consistent once I disable address space randomization (sysctl 
kernel.randomize_va_space=0).

Original comment by alkondratenko on 19 Jul 2014 at 8:36

GoogleCodeExporter commented 9 years ago

Note that to run it on my 8 gigs laptop I had to lower history size to 256k.

Original comment by alkondratenko on 19 Jul 2014 at 8:47

GoogleCodeExporter commented 9 years ago

K - just gonna spin up a big VM to try to repeat it.

Original comment by jamesgol...@gmail.com on 19 Jul 2014 at 8:48

GoogleCodeExporter commented 9 years ago

Regarding tlsf. I was actually thinking to generalize PageHeap's spanlist into 
tlsf. And make sure to give it more bits of precision. At least 8 so that 1-256 
pages spans are located located precisely.

You might be still right of course.

Original comment by alkondratenko on 19 Jul 2014 at 8:50

GoogleCodeExporter commented 9 years ago

Actually, I've been thinking that using tlsf for thread caches might be 
interesting.

Original comment by jamesgol...@gmail.com on 19 Jul 2014 at 8:54

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b8c210 in tcmalloc::DLL_Remove (span=span@entry=0xa02210) at 
src/span.cc:80
80    span->next->prev = span->prev;
(gdb) bt
#0  0x00007ffff7b8c210 in tcmalloc::DLL_Remove (span=span@entry=0xa02210) at 
src/span.cc:80
#1  0x00007ffff7b8ab14 in tcmalloc::PageHeap::RemoveFromFreeList 
(this=this@entry=0xa82000, span=span@entry=0xa02210) at src/page_heap.cc:408
#2  0x00007ffff7b8ab4c in tcmalloc::PageHeap::Carve (this=0xa82000, 
span=0xa02210, n=1964) at src/page_heap.cc:231
#3  0x00007ffff7b8b921 in tcmalloc::PageHeap::New (this=0xa82000, 
n=n@entry=1964) at src/page_heap.cc:121
#4  0x00007ffff7b9b8c4 in do_malloc_pages (size=16089088, heap=0xb0d0c8) at 
src/tcmalloc.cc:1081
#5  do_malloc_no_errno (size=16087463) at src/tcmalloc.cc:1114
#6  do_malloc (size=16087463) at src/tcmalloc.cc:1119
#7  do_malloc_or_cpp_alloc (size=16087463) at src/tcmalloc.cc:1039
#8  tc_malloc (size=16087463) at src/tcmalloc.cc:1578
#9  0x000000000040080a in main () at /root/alk-malloc-test.c:55
(gdb)

Original comment by alkondratenko on 19 Jul 2014 at 9:10

GoogleCodeExporter commented 9 years ago

Yup - I got it repeating. I'll get it fixed.

Original comment by jamesgol...@gmail.com on 19 Jul 2014 at 9:10

GoogleCodeExporter commented 9 years ago

The bug was in the test program. It crashes exactly the same way on master as 
well because it double frees pointers. Fix is here: 
https://gist.github.com/jamesgolick/91673ddc31ce6d7b371c/revisions

Original comment by jamesgol...@gmail.com on 20 Jul 2014 at 1:40

GoogleCodeExporter commented 9 years ago

Ah. It is subtle indeed. I'm sorry for my fault. Thanks for figuring this out.

Original comment by alkondratenko on 21 Jul 2014 at 2:34

GoogleCodeExporter commented 9 years ago

No prob. There's still one bug right now, where the test program attempts to 
free an invalid pointer. I am trying to figure out whether it's another bug in 
the test program or not right now. Running this test program against my llrb 
branch takes a minute or two. Running against master takes 30+ hours :-), so 
I'll let you know what the result is soon.

Original comment by jamesgol...@gmail.com on 21 Jul 2014 at 2:36

xfun68 / gperftools

[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap #532