One of our Redis servers crashes frequently

GoogleCodeExporter commented 8 years ago

What version of Redis you are using, in what kind of Operating System?
Redis versions: 2.2.1/2.2.2/2.2.4 
System: Red Hat Enterprise Linux Server release 5.5

What is the problem you are experiencing?
We have dozens of Redis in 3 servers. 
Every service is robust, except one crashes frequently.
We have tested 2.2.1/2.2.2/2.2.4, none solves this problem.

What steps will reproduce the problem?
Not very clearly. It seems to crash randomly. 

Do you have an INFO output? Please past it here.

If it is a crash, can you please paste the stack trace that you can find in
the log file or on standard output? This is really useful for us!
[25846] 01 May 20:46:48 # ======= Ooops! Redis 2.2.4 got signal: -11- =======
[25846] 01 May 20:46:48 # redis_version:2.2.4
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:25846
uptime_in_seconds:775965
uptime_in_days:8
lru_clock:401976
used_cpu_sys:12823.11
used_cpu_user:9897.50
used_cpu_sys_childrens:0.00
used_cpu_user_childrens:0.00
connected_clients:12
connected_slaves:0
client_longest_output_list:1
client_biggest_input_buf:0
blocked_clients:0
used_memory:3046435192
used_memory_human:2.84G
used_memory_rss:3699609600
mem_fragmentation_ratio:1.21
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:98054217
bgsave_in_progress:0
last_save_time:1303478043
total_connections_received:28092933
total_commands_processed:1684214170
expired_keys:278925
evicted_keys:0
keyspace_hits:1674902439
keyspace_misses:7533554
hash_max_zipmap_entries:512
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
allocation_stats:2=6449197,6=139,7=79,8=4322558,9=614635210,10=657357700,
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) 
[0x4141cd]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) 
[0x4141cd]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypeNext+0xa0) 
[0x41da20]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypeConvert+0x66) 
[0x41dad6]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypePush+0xea) 
[0x41e03a]
[25846] 01 May 20:46:48 # 
/home/redis3/bin/redis-server(pushGenericCommand+0xac) [0x41e83c]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(call+0x23) [0x40db13]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(processCommand+0x202) 
[0x40ddc2]
[25846] 01 May 20:46:48 # 
/home/redis3/bin/redis-server(processInputBuffer+0x54) [0x415f14]
[25846] 01 May 20:46:48 # 
/home/redis3/bin/redis-server(readQueryFromClient+0x6b) [0x415fdb]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(aeProcessEvents+0x173) 
[0x40aa83]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(aeMain+0x2e) [0x40ac9e]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(main+0xed) [0x40f84d]
[25846] 01 May 20:46:48 # /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x3789e1d994]
[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server [0x40a079]

Please provide any additional information below.

Original issue reported on code.google.com by nuaaknighte@gmail.com on 5 May 2011 at 4:10

GoogleCodeExporter commented 8 years ago

Thanks for the report. The stack trace shows that this happens while the list 
is being converted from a ziplist to a linked list. Can you tell some more 
about the average length of your lists and length of the elements inside it? 
Also, could you provide INFO output and a dump of the configuration file? I'm 
investigating in the mean time.

Original comment by pcnoordh...@gmail.com on 5 May 2011 at 7:31

GoogleCodeExporter commented 8 years ago

You mention the crashes happen randomly, does this mean with the same stack 
trace or not?  When one of the three machines is showing problems, while every 
machine is subject to the same access patterns, this might hav something to do 
with corrupt memory of something like that. Do the machines have ECC memory? 
When the stack trace is identical across crashes, we can assume there's a bug 
in the code somewhere and can work from there. Could you post these details?

Ps. Never mind the INFO output: I overlooked it for the previous post, and went 
immediately to the stack trace ;-).

Original comment by pcnoordh...@gmail.com on 5 May 2011 at 8:13

GoogleCodeExporter commented 8 years ago

thanks for ur answer, im the nuaaknighte's colleague who works with him on 
Redis, and we guess the problem might be that when the lengh of a list over the 
conf which we set, then it should be unzipped to a normal date type which u 
mentioned in the doc, and then because of some op this list became shorter than 
the conf it will be zipped again, the problem happened. thats only our guess, 
and i hope it will help u to identify and to solve it.

Original comment by 7842...@qq.com on 5 May 2011 at 8:48

GoogleCodeExporter commented 8 years ago

It seems like there is a problem in the ziplist implementation or in the 
conversion from ziplist to normal list, but to confirm this it will be very 
useful to have some past back trace as Pieter suggested. If the stack trace 
always looks similar we can narrow our investigation to the ziplist 
implementation and/or to the ziplist -> linkedlist conversion.

Thanks for your help!
Salvatore

Original comment by anti...@gmail.com on 5 May 2011 at 8:52

GoogleCodeExporter commented 8 years ago

there is another thing i have to mention, nuaaknighte said 'It seems to crash 
randomly.' he means the time of it happened is randomly but not the op 
commands. actually, last time (yesterday)before it crashed we have dumped the 
data, but during we restore the data it crashed again. And we tried several 
times to restore the data, but it crashed everytime when it restored in similer 
percent, so we guess there is might be some special commands cause it to be 
crashed, in the last, thank u for reading it again.

Original comment by 7842...@qq.com on 5 May 2011 at 8:58

GoogleCodeExporter commented 8 years ago

Give us this dump that cause the crash and we can fix the whole problem in 
minutes :)

Original comment by anti...@gmail.com on 5 May 2011 at 9:00

GoogleCodeExporter commented 8 years ago

We have tested 2.2.1/2.2.2/2.2.4, none solves this problem. but the only one 
server crashed. the different between them is: only the crashed redis has to 
operate overlength list.

Original comment by 7842...@qq.com on 5 May 2011 at 9:02

GoogleCodeExporter commented 8 years ago

T_T badly, we dont have that dump file which cause the crash because it was 
overwrited after we restart the redis...

Original comment by 7842...@qq.com on 5 May 2011 at 9:06

GoogleCodeExporter commented 8 years ago

ok it was helpful but not a 'sine qua none' condition to solve the problem. 
Will just be a bit harder. Please can you send use the exact redis-server 2.2.4 
binary that generated the stack trace you posted above? This can be helpful in 
order to turn the offsets in the stack trace into source code lines.

Original comment by anti...@gmail.com on 5 May 2011 at 9:09

GoogleCodeExporter commented 8 years ago

Pieter: I just disassembled the same version in the Linux box and I think I got 
the same offsets of our user's binary. The segfault appears to be here:

628         p = ZIPLIST_ENTRY_TAIL(zl);
629         return (p[0] == ZIP_END) ? NULL : p;

Basically p points to an invalid address, so p[0] causes the exception:

0x0000000000416320 <ziplistNext+0>: push   %rbx
0x0000000000416321 <ziplistNext+1>: mov    %rsi,%rbx
0x0000000000416324 <ziplistNext+4>: sub    $0x20,%rsp
0x0000000000416328 <ziplistNext+8>: cmpb   $0xff,(%rsi)
0x000000000041632b <ziplistNext+11>:    je     0x416350 <ziplistNext+48>
0x000000000041632d <ziplistNext+13>:    mov    %rsp,%rdi
0x0000000000416330 <ziplistNext+16>:    callq  0x416140 <zipEntry>
0x0000000000416335 <ziplistNext+21>:    mov    0xc(%rsp),%eax
0x0000000000416339 <ziplistNext+25>:    add    0x10(%rsp),%eax
0x000000000041633d <ziplistNext+29>:    lea    (%rbx,%rax,1),%rax
0x0000000000416341 <ziplistNext+33>:    cmpb   $0xff,(%rax)

In the stack trace we have:

[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) 
[0x4141cd]

and ziplistNext+0x1d is what you see in the disassemble at ziplistNext+29

For some lame reason the stack trace uses hexadecimal offsets and gdb decimal 
ones... :)

Not enough to fix the problem but I wanted to show this info with you.

Salvatore

Original comment by anti...@gmail.com on 5 May 2011 at 10:01

GoogleCodeExporter commented 8 years ago

@nuaaknighte: please could you provide a list of commands used against lists?

For instance do you use LREM, LINSERT, LPUSH, RPUSH, ... or what? The list of 
all the commands used will be very helpful. Also another useful information is 
the kind of values you store inside the strings. Are numbers or strings? What 
is the average size of the string more or less? Thank you.

Salvatore

p.s. please if you obtain a new dump.rdb file that is able to crash the server 
save it!!! :)

Original comment by anti...@gmail.com on 5 May 2011 at 10:06

GoogleCodeExporter commented 8 years ago

Another thing we need is the output of:

redis-cli config GET '*'

thank you

Original comment by anti...@gmail.com on 5 May 2011 at 10:17

GoogleCodeExporter commented 8 years ago

Problem fixed in all the branches: 2.2, 2.4, unstable, 2.2-scripting, scripting.

Releasing 2.2.5 after adding the regression test.

Thanks to the original reporter of the problem and to Pieter Noordhuis for the 
actual fix!

Salvatore

Original comment by anti...@gmail.com on 5 May 2011 at 1:31

GoogleCodeExporter commented 8 years ago

I'm still seeing a very similar crash on 2.2.6:

May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: ======= Ooops! Redis 2.2.6 
got signal: -11- =======
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
redis_version:2.2.6#015#012redis_git_sha1:00000000#015#012redis_git_dirty:0#015#
012arch_bits:32#015#012multiplexing_api:epoll#015#012process_id:25679#015#012upt
ime_in_seconds:1762#015#012uptime_in_days:0#015#012lru_clock:483061#015#012used_
cpu_sys:0.95#015#012used_cpu_user:0.60#015#012used_cpu_sys_childrens:3.06#015#01
2used_cpu_user_childrens:0.58#015#012connected_clients:6#015#012connected_slaves
:0#015#012client_longest_output_list:0#015#012client_biggest_input_buf:0#015#012
blocked_clients:1#015#012used_memory:43397212#015#012used_memory_human:41.39M#01
5#012used_memory_rss:29003776#015#012mem_fragmentation_ratio:0.67#015#012use_tcm
alloc:0#015#012loading:0#015#012aof_enabled:0#015#012changes_since_last_save:106
#015#012bgsave_in_progress:0#015#012last_save_time:1305064608#015#012bgrewriteao
f_in_progress:0#015#012total_connections_received:42#015#012total_commands_proce
ssed:2866#015#012expired_keys:0#015#012evicted_keys:0#015#012keyspace_hits:836#0
15#012keyspace_misses:1941#015#012hash_max_zipmap_entries:512#015#012hash_max_zi
pmap_value:64#015#012pubsub_channels:1#015#012pubsub_patterns:0#015#012vm_enable
d:1#015#012role:master#015#012vm_conf_max_memory:270582939#015#012vm_conf_page_s
ize:32#015#012vm_conf_pages:134217728#015#012vm_stats_used_pages:0#015#012vm_sta
ts_swapped_ob
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(handleClientsWaitingListPush+0xb6) [0x8068346]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(pushGenericCommand+0x131) [0x8068ae1]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(lpushCommand+0x19) [0x8068b49]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(call+0x25) [0x80555e5]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(processCommand+0x260) [0x8055930]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(processInputBuffer+0x5f) [0x805e30f]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(readQueryFromClient+0x70) [0x805e400]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(aeProcessEvents+0x132) [0x80520a2]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(aeMain+0x37) [0x80522b7]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server(main+0x112) [0x8057752]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/lib/tls/i686/nosegneg/libc.so.6(__libc_start_main+0xe6) [0xb756bbd6]
May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: 
/usr/local/bin/redis-server() [0x80516a1]

$ uname -a
Linux 2.6.32-312-ec2 #24-Ubuntu SMP Fri Jan 7 18:31:34 UTC 2011 i686 GNU/Linux

Attached are the server binary, the configuration used and the database as it 
was on disk after the crash.

Original comment by j...@dwim.org on 10 May 2011 at 10:47

Attachments:

GoogleCodeExporter commented 8 years ago

Thanks for reporting, this looks like a different issue. Do you use BLPOP or 
other blocking list operations? Thanks for everything you attached, this is 
useful.

Original comment by anti...@gmail.com on 11 May 2011 at 7:00

GoogleCodeExporter commented 8 years ago

The bug is now fixed, sorry for the inconvenient.

You can either use latest 2.2 branch on github, or simply download 2.2.6 and 
remove list 820 from t_list.c:

Where there is:

        unblockClientWaitingData(receiver);
        redisAssert(ln != listFirst(clients));

Make sure it will actually be just the first line:

        unblockClientWaitingData(receiver);
        // redisAssert(ln != listFirst(clients));

(In the above example I just commented the line, but you can remove it if you 
wish).

Redis 2.2.7 with this fix will be released today.

Original comment by anti...@gmail.com on 11 May 2011 at 7:53

GoogleCodeExporter commented 8 years ago

Sorry I mean "line 820" not "list 820"...

Original comment by anti...@gmail.com on 11 May 2011 at 8:00

yst726 / redis

One of our Redis servers crashes frequently #543