Open GoogleCodeExporter opened 8 years ago
Thanks for the report. The stack trace shows that this happens while the list
is being converted from a ziplist to a linked list. Can you tell some more
about the average length of your lists and length of the elements inside it?
Also, could you provide INFO output and a dump of the configuration file? I'm
investigating in the mean time.
Original comment by pcnoordh...@gmail.com
on 5 May 2011 at 7:31
You mention the crashes happen randomly, does this mean with the same stack
trace or not? When one of the three machines is showing problems, while every
machine is subject to the same access patterns, this might hav something to do
with corrupt memory of something like that. Do the machines have ECC memory?
When the stack trace is identical across crashes, we can assume there's a bug
in the code somewhere and can work from there. Could you post these details?
Ps. Never mind the INFO output: I overlooked it for the previous post, and went
immediately to the stack trace ;-).
Original comment by pcnoordh...@gmail.com
on 5 May 2011 at 8:13
thanks for ur answer, im the nuaaknighte's colleague who works with him on
Redis, and we guess the problem might be that when the lengh of a list over the
conf which we set, then it should be unzipped to a normal date type which u
mentioned in the doc, and then because of some op this list became shorter than
the conf it will be zipped again, the problem happened. thats only our guess,
and i hope it will help u to identify and to solve it.
Original comment by 7842...@qq.com
on 5 May 2011 at 8:48
It seems like there is a problem in the ziplist implementation or in the
conversion from ziplist to normal list, but to confirm this it will be very
useful to have some past back trace as Pieter suggested. If the stack trace
always looks similar we can narrow our investigation to the ziplist
implementation and/or to the ziplist -> linkedlist conversion.
Thanks for your help!
Salvatore
Original comment by anti...@gmail.com
on 5 May 2011 at 8:52
there is another thing i have to mention, nuaaknighte said 'It seems to crash
randomly.' he means the time of it happened is randomly but not the op
commands. actually, last time (yesterday)before it crashed we have dumped the
data, but during we restore the data it crashed again. And we tried several
times to restore the data, but it crashed everytime when it restored in similer
percent, so we guess there is might be some special commands cause it to be
crashed, in the last, thank u for reading it again.
Original comment by 7842...@qq.com
on 5 May 2011 at 8:58
Give us this dump that cause the crash and we can fix the whole problem in
minutes :)
Original comment by anti...@gmail.com
on 5 May 2011 at 9:00
We have tested 2.2.1/2.2.2/2.2.4, none solves this problem. but the only one
server crashed. the different between them is: only the crashed redis has to
operate overlength list.
Original comment by 7842...@qq.com
on 5 May 2011 at 9:02
T_T badly, we dont have that dump file which cause the crash because it was
overwrited after we restart the redis...
Original comment by 7842...@qq.com
on 5 May 2011 at 9:06
ok it was helpful but not a 'sine qua none' condition to solve the problem.
Will just be a bit harder. Please can you send use the exact redis-server 2.2.4
binary that generated the stack trace you posted above? This can be helpful in
order to turn the offsets in the stack trace into source code lines.
Original comment by anti...@gmail.com
on 5 May 2011 at 9:09
Pieter: I just disassembled the same version in the Linux box and I think I got
the same offsets of our user's binary. The segfault appears to be here:
628 p = ZIPLIST_ENTRY_TAIL(zl);
629 return (p[0] == ZIP_END) ? NULL : p;
Basically p points to an invalid address, so p[0] causes the exception:
0x0000000000416320 <ziplistNext+0>: push %rbx
0x0000000000416321 <ziplistNext+1>: mov %rsi,%rbx
0x0000000000416324 <ziplistNext+4>: sub $0x20,%rsp
0x0000000000416328 <ziplistNext+8>: cmpb $0xff,(%rsi)
0x000000000041632b <ziplistNext+11>: je 0x416350 <ziplistNext+48>
0x000000000041632d <ziplistNext+13>: mov %rsp,%rdi
0x0000000000416330 <ziplistNext+16>: callq 0x416140 <zipEntry>
0x0000000000416335 <ziplistNext+21>: mov 0xc(%rsp),%eax
0x0000000000416339 <ziplistNext+25>: add 0x10(%rsp),%eax
0x000000000041633d <ziplistNext+29>: lea (%rbx,%rax,1),%rax
0x0000000000416341 <ziplistNext+33>: cmpb $0xff,(%rax)
In the stack trace we have:
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d)
[0x4141cd]
and ziplistNext+0x1d is what you see in the disassemble at ziplistNext+29
For some lame reason the stack trace uses hexadecimal offsets and gdb decimal
ones... :)
Not enough to fix the problem but I wanted to show this info with you.
Salvatore
Original comment by anti...@gmail.com
on 5 May 2011 at 10:01
@nuaaknighte: please could you provide a list of commands used against lists?
For instance do you use LREM, LINSERT, LPUSH, RPUSH, ... or what? The list of
all the commands used will be very helpful. Also another useful information is
the kind of values you store inside the strings. Are numbers or strings? What
is the average size of the string more or less? Thank you.
Salvatore
p.s. please if you obtain a new dump.rdb file that is able to crash the server
save it!!! :)
Original comment by anti...@gmail.com
on 5 May 2011 at 10:06
Another thing we need is the output of:
redis-cli config GET '*'
thank you
Original comment by anti...@gmail.com
on 5 May 2011 at 10:17
Problem fixed in all the branches: 2.2, 2.4, unstable, 2.2-scripting, scripting.
Releasing 2.2.5 after adding the regression test.
Thanks to the original reporter of the problem and to Pieter Noordhuis for the
actual fix!
Salvatore
Original comment by anti...@gmail.com
on 5 May 2011 at 1:31
I'm still seeing a very similar crash on 2.2.6:
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: ======= Ooops! Redis 2.2.6
got signal: -11- =======
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
redis_version:2.2.6#015#012redis_git_sha1:00000000#015#012redis_git_dirty:0#015#
012arch_bits:32#015#012multiplexing_api:epoll#015#012process_id:25679#015#012upt
ime_in_seconds:1762#015#012uptime_in_days:0#015#012lru_clock:483061#015#012used_
cpu_sys:0.95#015#012used_cpu_user:0.60#015#012used_cpu_sys_childrens:3.06#015#01
2used_cpu_user_childrens:0.58#015#012connected_clients:6#015#012connected_slaves
:0#015#012client_longest_output_list:0#015#012client_biggest_input_buf:0#015#012
blocked_clients:1#015#012used_memory:43397212#015#012used_memory_human:41.39M#01
5#012used_memory_rss:29003776#015#012mem_fragmentation_ratio:0.67#015#012use_tcm
alloc:0#015#012loading:0#015#012aof_enabled:0#015#012changes_since_last_save:106
#015#012bgsave_in_progress:0#015#012last_save_time:1305064608#015#012bgrewriteao
f_in_progress:0#015#012total_connections_received:42#015#012total_commands_proce
ssed:2866#015#012expired_keys:0#015#012evicted_keys:0#015#012keyspace_hits:836#0
15#012keyspace_misses:1941#015#012hash_max_zipmap_entries:512#015#012hash_max_zi
pmap_value:64#015#012pubsub_channels:1#015#012pubsub_patterns:0#015#012vm_enable
d:1#015#012role:master#015#012vm_conf_max_memory:270582939#015#012vm_conf_page_s
ize:32#015#012vm_conf_pages:134217728#015#012vm_stats_used_pages:0#015#012vm_sta
ts_swapped_ob
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(handleClientsWaitingListPush+0xb6) [0x8068346]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(pushGenericCommand+0x131) [0x8068ae1]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(lpushCommand+0x19) [0x8068b49]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(call+0x25) [0x80555e5]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(processCommand+0x260) [0x8055930]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(processInputBuffer+0x5f) [0x805e30f]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(readQueryFromClient+0x70) [0x805e400]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(aeProcessEvents+0x132) [0x80520a2]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(aeMain+0x37) [0x80522b7]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server(main+0x112) [0x8057752]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/lib/tls/i686/nosegneg/libc.so.6(__libc_start_main+0xe6) [0xb756bbd6]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]:
/usr/local/bin/redis-server() [0x80516a1]
$ uname -a
Linux 2.6.32-312-ec2 #24-Ubuntu SMP Fri Jan 7 18:31:34 UTC 2011 i686 GNU/Linux
Attached are the server binary, the configuration used and the database as it
was on disk after the crash.
Original comment by j...@dwim.org
on 10 May 2011 at 10:47
Attachments:
Thanks for reporting, this looks like a different issue. Do you use BLPOP or
other blocking list operations? Thanks for everything you attached, this is
useful.
Original comment by anti...@gmail.com
on 11 May 2011 at 7:00
The bug is now fixed, sorry for the inconvenient.
You can either use latest 2.2 branch on github, or simply download 2.2.6 and
remove list 820 from t_list.c:
Where there is:
unblockClientWaitingData(receiver);
redisAssert(ln != listFirst(clients));
Make sure it will actually be just the first line:
unblockClientWaitingData(receiver);
// redisAssert(ln != listFirst(clients));
(In the above example I just commented the line, but you can remove it if you
wish).
Redis 2.2.7 with this fix will be released today.
Original comment by anti...@gmail.com
on 11 May 2011 at 7:53
Sorry I mean "line 820" not "list 820"...
Original comment by anti...@gmail.com
on 11 May 2011 at 8:00
Original issue reported on code.google.com by
nuaaknighte@gmail.com
on 5 May 2011 at 4:10