error of lru_crawler - Githubissues

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

I am working on a requirement that release memory of memcached to system, and 
when I call the lru_crawler_crawl("all") in a timer every 5 second(5 seconds is 
only for test), the following error report.
memcached-debug: items.c:643: crawler_link_q: Assertion `it != *tail' failed.

in function 'lru_crawler_crawl()', I think it should evaluate the value of 
'crawlers[sid].it_flags' before call 'crawler_link_q((item *)&crawlers[sid])'.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?
1.4.21 (most update code)

Please provide any additional information below.
OS: centos 6.5, 
Staring command:/root/memcached/memcached-debug -m 32 -I 100k -o slab_reassign 
lru_crawler slab_automove=1 -p 42872 -U 34756 -u root

Original issue reported on code.google.com by Z.W.Chan...@gmail.com on 5 Dec 2014 at 8:36

GoogleCodeExporter commented 9 years ago

Not sure yet how that happens. Were you able to catch it in a debugger and 
print a bit more information around the crawler and LRU's?

cache_lock is held with it_flags is being set to 1, and also when it's set to 
0.. and the thread that checks for != 1 is the only place that can set it to 0. 
So worst case the loop gets a false-negative and skips that lru's crawler for 
that iteration.

crawler_link_q() is only ever called when it first kicks off a crawl.

the crawler isn't signaled until after all new crawlers are initialized, and 
lru_crawler_lock is held for the entire time the crawler is active.

This possibly means that crawler_unlink_q() is failing sometimes? If the 
crawler is left in the LRU, it will be ignored by the item alloc and could fall 
to the tail. Once in the tail, and another crawl kicks off, it'll find itself 
in the tail and fail that assert.

Don't see how unlink_q could fail though... maybe crawler_crawl_q() has a bug 
that links the crawler to itself sometimes? So after it delinks it ends up 
still in the queue.

Is this happening a lot? Or have you fixed it already? If not, how are you 
getting it to trigger?

Would be useful to see the state of the lru that the bad crawler was in. Is it 
the only item (both head and tail?), or was anything else in there?

Original comment by dorma...@rydia.net on 15 Dec 2014 at 6:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

 I understand the issue now. 
 It is the same issue as the #388. The lru_crawler_cond is not waited by any thread, so the crawler is added to the LRU again. I fix #388 at local by initialize lru_crawler before options parsing, and add a check to the crawler->it_flags at the same time. 

----------------
Starting program: /root/memcached/memcached-debug -u root -p 4444 -o 
lru_crawler,slab_reassign
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7fe4700 (LWP 26038)]
[New Thread 0x7ffff732b700 (LWP 26039)]
[New Thread 0x7ffff692a700 (LWP 26040)]
[New Thread 0x7ffff5f29700 (LWP 26041)]
[New Thread 0x7ffff5528700 (LWP 26042)]
[New Thread 0x7ffff4b27700 (LWP 26043)]
[New Thread 0x7ffff4126700 (LWP 26044)]
[New Thread 0x7ffff3725700 (LWP 26045)]
memcached-debug: items.c:649: crawler_link_q: Assertion `it != *tail' failed.

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff4126700 (LWP 26044)]
0x0000003fcfa32635 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x0000003fcfa32635 in raise () from /lib64/libc.so.6
#1  0x0000003fcfa33e15 in abort () from /lib64/libc.so.6
#2  0x0000003fcfa2b75e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003fcfa2b820 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004242a7 in crawler_link_q (it=0x641070) at items.c:649
#5  0x0000000000425a1f in lru_crawler_crawl (slabs=0x432fbe "all")
    at items.c:919
#6  0x0000000000420438 in slab_maintenance_thread (arg=0x0) at slabs.c:737
#7  0x0000003fd02079d1 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003fcfae886d in clone () from /lib64/libc.so.6
(gdb) f 4
#4  0x00000000004242a7 in crawler_link_q (it=0x641070) at items.c:649
649     assert(it != *tail);
(gdb)  p it->slabs_clsid
$1 = 1 '\001'
(gdb) p *tail
$2 = (item *) 0x641070
(gdb) p *(item*)tail
$3 = {next = 0x641070, prev = 0x0, h_next = 0x0, time = 0, exptime = 0, 
  nbytes = 0, refcount = 0, nsuffix = 0 '\000', it_flags = 0 '\000', 
  slabs_clsid = 0 '\000', nkey = 0 '\000', data = 0x640a08}
(gdb) p lru_crawler_cond
$4 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, 
    __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, 
  __size = '\000' <repeats 47 times>, __align = 0}

Original comment by Z.W.Chan...@gmail.com on 15 Dec 2014 at 9:09

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

excellent. Last night I reproduced this with a test and pushed a patch for the 
next release: 
https://github.com/memcached/memcached/commit/4ceefe3ea03aae8d25538b00098a5730b9
91a9fd

I'll close out this bug.

Original comment by dorma...@rydia.net on 15 Dec 2014 at 9:14

Changed state: Duplicate
Added labels: ****
Removed labels: ****

zsswebdesign / memcached

error of lru_crawler #387