Closed GoogleCodeExporter closed 9 years ago
Not sure yet how that happens. Were you able to catch it in a debugger and
print a bit more information around the crawler and LRU's?
cache_lock is held with it_flags is being set to 1, and also when it's set to
0.. and the thread that checks for != 1 is the only place that can set it to 0.
So worst case the loop gets a false-negative and skips that lru's crawler for
that iteration.
crawler_link_q() is only ever called when it first kicks off a crawl.
the crawler isn't signaled until after all new crawlers are initialized, and
lru_crawler_lock is held for the entire time the crawler is active.
This possibly means that crawler_unlink_q() is failing sometimes? If the
crawler is left in the LRU, it will be ignored by the item alloc and could fall
to the tail. Once in the tail, and another crawl kicks off, it'll find itself
in the tail and fail that assert.
Don't see how unlink_q could fail though... maybe crawler_crawl_q() has a bug
that links the crawler to itself sometimes? So after it delinks it ends up
still in the queue.
Is this happening a lot? Or have you fixed it already? If not, how are you
getting it to trigger?
Would be useful to see the state of the lru that the bad crawler was in. Is it
the only item (both head and tail?), or was anything else in there?
Original comment by dorma...@rydia.net
on 15 Dec 2014 at 6:47
I understand the issue now.
It is the same issue as the #388. The lru_crawler_cond is not waited by any thread, so the crawler is added to the LRU again. I fix #388 at local by initialize lru_crawler before options parsing, and add a check to the crawler->it_flags at the same time.
----------------
Starting program: /root/memcached/memcached-debug -u root -p 4444 -o
lru_crawler,slab_reassign
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7fe4700 (LWP 26038)]
[New Thread 0x7ffff732b700 (LWP 26039)]
[New Thread 0x7ffff692a700 (LWP 26040)]
[New Thread 0x7ffff5f29700 (LWP 26041)]
[New Thread 0x7ffff5528700 (LWP 26042)]
[New Thread 0x7ffff4b27700 (LWP 26043)]
[New Thread 0x7ffff4126700 (LWP 26044)]
[New Thread 0x7ffff3725700 (LWP 26045)]
memcached-debug: items.c:649: crawler_link_q: Assertion `it != *tail' failed.
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4126700 (LWP 26044)]
0x0000003fcfa32635 in raise () from /lib64/libc.so.6
(gdb) where
#0 0x0000003fcfa32635 in raise () from /lib64/libc.so.6
#1 0x0000003fcfa33e15 in abort () from /lib64/libc.so.6
#2 0x0000003fcfa2b75e in __assert_fail_base () from /lib64/libc.so.6
#3 0x0000003fcfa2b820 in __assert_fail () from /lib64/libc.so.6
#4 0x00000000004242a7 in crawler_link_q (it=0x641070) at items.c:649
#5 0x0000000000425a1f in lru_crawler_crawl (slabs=0x432fbe "all")
at items.c:919
#6 0x0000000000420438 in slab_maintenance_thread (arg=0x0) at slabs.c:737
#7 0x0000003fd02079d1 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003fcfae886d in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x00000000004242a7 in crawler_link_q (it=0x641070) at items.c:649
649 assert(it != *tail);
(gdb) p it->slabs_clsid
$1 = 1 '\001'
(gdb) p *tail
$2 = (item *) 0x641070
(gdb) p *(item*)tail
$3 = {next = 0x641070, prev = 0x0, h_next = 0x0, time = 0, exptime = 0,
nbytes = 0, refcount = 0, nsuffix = 0 '\000', it_flags = 0 '\000',
slabs_clsid = 0 '\000', nkey = 0 '\000', data = 0x640a08}
(gdb) p lru_crawler_cond
$4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
__woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
__size = '\000' <repeats 47 times>, __align = 0}
Original comment by Z.W.Chan...@gmail.com
on 15 Dec 2014 at 9:09
excellent. Last night I reproduced this with a test and pushed a patch for the
next release:
https://github.com/memcached/memcached/commit/4ceefe3ea03aae8d25538b00098a5730b9
91a9fd
I'll close out this bug.
Original comment by dorma...@rydia.net
on 15 Dec 2014 at 9:14
Original issue reported on code.google.com by
Z.W.Chan...@gmail.com
on 5 Dec 2014 at 8:36