yc81 / memcached

Automatically exported from code.google.com/p/memcached
0 stars 0 forks source link

Memcached "hung" in do_item_alloc #272

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What is the problem?
One night at Etsy, we started getting a flood of memcached errors. We were 
seeing out-of-memory errors when trying to set/add any value into slab #4. The 
slab held data of sizes upto 10 bytes.

I did some low-level checking around what happens in do_item_alloc() because we 
hadn't built debug-symbols :( Details here: https://gist.github.com/2634998

Starting with search = tails[id]; in line 106. The code's here: 
https://github.com/memcached/memcached/blob/master/items.c

The search item has a starting ref count of 2. So calling refcount_incr on it 
makes it 3 and it entirely skips the block of code between 107 and 152 (that 
block handles expired items and forcible eviction). In the else block in line 
154, since the slab is full, slabs_alloc returns NULL. In line 156, the 
refcount is decremented back to 2.

Now in the if block starting in line 167, the refcount is still 2, so it never 
forcibly expires the item. And this stays like this forever. This check was 
changed from a refcount !=0 to refcount !=2 recently 
(https://github.com/memcached/memcached/commit/f4983b2068d13e5dc71fc075c35a085e9
04999cf#items.c)

When this happens the "leaked" tails item forever blocks any new item from 
being inserted into the list and we can't do any more operations on the cluster 
in slab #4. The only way we could fix it was to restart memcached. 

What steps will reproduce the problem?
We don't exactly know how all 7 of our memcached boxes got into this state one 
night. We still get occasional blips of these errors.

What is the expected output? What do you see instead?
Set/add should work.

What version of the product are you using? On what operating system?
memcached 1.4.13 on Cent5

Please provide any additional information below.
In our testing environment, I haven't been able to reproduce the issue, but 
then the volume I could generate definitely couldn't match up to Production.

Original issue reported on code.google.com by keyurgov...@gmail.com on 8 May 2012 at 1:37

GoogleCodeExporter commented 9 years ago
Looks like a refcount leak bug has reappeared...

Any chance you could use mc-crusher (https://github.com/dormando/mc-crusher) 
and try to set up a config file that roughly regenerates your traffic? that 
could help in you finding how to reproduce it.

In the meantime, I'll give it a good hard stare when I get a chance...

Also; what startup options were you using? Any chance I could see a "stats 
settings" and a "stats/stats items/stats slabs" dump from an instance that's 
been up a while?

Original comment by dorma...@rydia.net on 8 May 2012 at 5:12

GoogleCodeExporter commented 9 years ago
Sure, here's the various stats: https://gist.github.com/24e1cfa5b2a18d6ccf42

I'll try the crusher later tonight.

Original comment by keyurgov...@gmail.com on 8 May 2012 at 10:02

GoogleCodeExporter commented 9 years ago
Ok. It might take a week before I get any real time to troubleshoot this...

Unfortunately I'm going to have to be a downer. I spot repcached stats; any 
chance you could see if a vanilla 1.4.13 doesn't exhibit the same crash? :( I 
can't troubleshoot a refcount leak in a patched instance.

Original comment by dorma...@rydia.net on 8 May 2012 at 10:12

GoogleCodeExporter commented 9 years ago
Good point, we don't really need/use repcache. Gonna try without that, see if 
it still occurs.

Thanks for the help!

Original comment by keyurgov...@gmail.com on 8 May 2012 at 10:38

GoogleCodeExporter commented 9 years ago
Ok, I'll hold off on looking into this until you can reproduce it on a 
completely vanilla compile.

Original comment by dorma...@rydia.net on 8 May 2012 at 11:31

GoogleCodeExporter commented 9 years ago
Any updates?

Original comment by dorma...@rydia.net on 23 May 2012 at 9:03

GoogleCodeExporter commented 9 years ago
Last ping (I hope you see these updates?) If I don't hear back in a few days 
I'm going to assume it was something repcached's patch did and close out the 
bug.

thanks!

Original comment by dorma...@rydia.net on 1 Jun 2012 at 5:46

GoogleCodeExporter commented 9 years ago
So sorry for the lack of feedback.  Rebuilt without repcached and we haven't 
seen the issue re-surface am pretty certain this was the cause. This can be 
closed.

Original comment by mbarc...@etsy.com on 2 Jun 2012 at 2:10

GoogleCodeExporter commented 9 years ago
That makes me incredibly happy, thank you! Resolving :D

Original comment by dorma...@rydia.net on 2 Jun 2012 at 8:04