100% CPU usage, very slow response/timeout

GoogleCodeExporter commented 8 years ago

I'm not sure I can give you a proper bug report, but let me create this issue, 
and if you have no idea at all, just close it.

What steps will reproduce the problem?
1. we have a 2 queue SET based queues, and 7 SET based cache
2. the queue is quite small, typically have <10 items (with <10 bytes)
3. the cache is about 800 MB, with typically 100 byte -> 8 kbyte values
4. we do about <1000 queries per second according to the MONITOR command
5. sometimes the CPU usage of Redis is going up to 100%, and staying there for 
30-60s
6. While the CPU usage is around 100%, the response time is very slow, 
sometimes over 30s, sometimes timeouting, the CLI benchmark tool says about 50 
GET queries per sec, while typically it says 45.000
7. I have thought the issue is related to background saving, but not
8. We're using the default factory settings, no virtual memory or other like 
this

redis_version:2.0.2
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:29610
uptime_in_seconds:6006
uptime_in_days:0
connected_clients:20
connected_slaves:0
blocked_clients:0
used_memory:904343768
used_memory_human:862.45M
changes_since_last_save:715
bgsave_in_progress:0
last_save_time:1294150490
bgrewriteaof_in_progress:0
total_connections_received:18036
total_commands_processed:3992865
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=15,expires=0

It's an Ubuntu Linux, on a 8 core processor machine. Redis' memory usage is 10%.

Original issue reported on code.google.com by barthazi...@gmail.com on 5 Jan 2011 at 1:31

GoogleCodeExporter commented 8 years ago

Hello, are you sure you don't run commands like KEYS from time to time? Or 
other similarly slow commands like big set intersections or alike?

Btw the simplest way to figure what happens is the following. When it blocks 
use the script you'll find at:

http://poormansprofiler.org/

And post here the output. Run it a few times while Redis is stopped. This 
script is able to tell us where Redis is spending so much time.

Thank you!
Salvatore

Original comment by anti...@gmail.com on 5 Jan 2011 at 1:38

GoogleCodeExporter commented 8 years ago

I confirm the issue.

Redis 2.0.4

Linux RHEL5 2.6.18 SMP, 16 cores

CPU usage ~ 100%

==> redis-benchmark:

====== PING ======
  10002 requests completed in 30.17 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1
......
331.57 requests per second (60343.38 requests per second - when we disable new 
services that use Redis intensively)

==> poormansprofiler output:

3 options:

1) 1 
at,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,ae
Main,main
2) 1 
"results:all:qZv9GAOK"),keysCommand,call,processCommand,processInputBuffer,readQ
ueryFromClient,aeProcessEvents,aeMain,main
3) 1 
dictNext,keysCommand,call,processCommand,processInputBuffer,readQueryFromClient,
aeProcessEvents,aeMain,main

also it produces the following warnings:

warning: .dynamic section for "/lib64/libm.so.6" is not at the expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libpthread.so.0" is not at the expected 
address

warning: difference appears to be caused by prelink, adjusting expectations

warning: .dynamic section for "/lib64/libc.so.6" is not at the expected address 
(wrong library or version mismatch?)

warning: .dynamic section for "/lib64/ld-linux-x86-64.so.2" is not at the 
expected address

warning: difference appears to be caused by prelink, adjusting expectations

warning: no loadable sections found in added symbol-file system-supplied DSO at 
0x7fff177fc000

When Redis works fine profile output is as follows:

1 __epoll_wait_nocancel,aeApiPoll,flags=3),aeMain,main

with the only warning

warning: no loadable sections found in added symbol-file system-supplied DSO at 
0x7fffa4ded000

Original comment by eugene.t...@gmail.com on 22 Jan 2011 at 12:15

GoogleCodeExporter commented 8 years ago

As expected your output of the poor man profiler shows that you are using KEYS 
command, that should never be used in production code by clients, but just for 
debugging.

As you can see in the documentation KEYS is O(N) both in space and time, and 
will trash Redis performances.

You can also check our documentation here:

http://redis.io/commands/keys

You can read there: "Warning: consider KEYS as a command that should only be 
used in production environments with extreme care. It may ruin performance when 
it is executed against large databases. This command is intended for debugging 
and special operations, such as changing your keyspace layout. Don't use KEYS 
in your regular application code. If you're looking for a way to find keys in a 
subset of your keyspace, consider using sets.".

Therefore the issue is not valid.

Original comment by anti...@gmail.com on 22 Jan 2011 at 1:39

GoogleCodeExporter commented 8 years ago

p.s. however thank you for reporting. While in this case the problem is clear 
in other cases the same macroscopic behavior may be due to something actually 
wrong in the server.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 22 Jan 2011 at 1:40

GoogleCodeExporter commented 8 years ago

Hi Salvatore,

Thanks for a prompt reply. We'll get rid of keys command and report here about 
results.

Cheers
Eugene

Original comment by eugene.t...@gmail.com on 22 Jan 2011 at 7:00

GoogleCodeExporter commented 8 years ago

I was the original reporter, we had not used KEYS or commands like that, but 
for be sure, I have checked it when I've tried debugging the problem. Anyway, 
we had only about 10 keys, so I guess KEYS would be quite fast in this 
situation. :)

Unfortunately we have replaced the code was based on Redis, so I can't profile 
it. I'll try to get some time for trying to repeat it, but I cannot say now 
that it will be soon.

Original comment by barthazi...@gmail.com on 22 Jan 2011 at 7:13

GoogleCodeExporter commented 8 years ago

I feel bad to speak contrary to the OP's statements, but it really smells like 
this is a bgsave operation overloading the (otherwise already fairly heavily 
loaded) box and causing other commands to run slowly.

How many processors in the machine? How much memory does it have? What 
operations are being performed? Anything else running on that box?

Original comment by josiah.c...@gmail.com on 23 Jan 2011 at 8:03

GoogleCodeExporter commented 8 years ago

Barthazi, @Josiah: BGSAVE can't overload the server as it's a different 
process. It can only matter that you run out of memory because of BGSAVE and 
the instance starts swapping, this of course will lead to a much slower server, 
but it's a different problem.

Anyway the only reliable way to debug this problems is to use a profiler to 
check where the time is spent, together with the other system tools to check if 
the system is out of memory, if there is OS buffers / pages I/O and so forth. 
Currently *every single time* it was possible to understand what was going on 
really, it was either an out of memory condition or the user doing something 
strange like KEYS, big set operations like intersections, huge MULTI/EXEC calls 
with hundeds of commands inside, and so forth.

I added the poor man profiler script in the utils in the unstable branch, file: 
utlis/whatisdoing.sh

This is so far the best we can do to debug this kind of issues.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 23 Jan 2011 at 9:21

GoogleCodeExporter commented 8 years ago

I didn't see that it was an 8 core server, or that Redis was using 10% of the 
system memory.

If it's using EBS at Amazon (or some other network storage wherever it is), 
operations may be slow during a BGSAVE due to network contention...

Original comment by josiah.c...@gmail.com on 24 Jan 2011 at 1:54

GoogleCodeExporter commented 8 years ago

Works fine without keys command.

Original comment by eugene.t...@gmail.com on 28 Jan 2011 at 2:35

GoogleCodeExporter commented 8 years ago

I would like to also ask for help in finding out why our redis is using 100% 
cpu. Apologies if I'm resurrecting an old thread.

The output for the poor man's profiler with 100 samples is

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description

warning: Selected architecture i386:x86-64 is not compatible with reported 
target architecture i386

warning: Architecture rejected target-supplied description
    101 
     22 __accept_nocancel,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
     20 open64,_IO_file_fopen,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
     16 __write_nocancel,sendReplyToClient,aeProcessEvents,aeMain,main
     12 epoll_wait,aeApiPoll,flags=<value,aeMain,main
      8 ??,??
      7 gettimeofday,gettimeofday,aeGetTime,aeProcessEvents,aeMain,main
      3 __read_nocancel,read,optimized,aeProcessEvents,aeMain,main
      3 gettimeofday,gettimeofday,aeGetTime,processTimeEvents,flags=<value,aeMain,main
      2 vfprintf,__vsnprintf_chk,vsnprintf,"",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
      1 __vsnprintf_chk,vsnprintf,"",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
      1 __vsnprintf_chk,vsnprintf,"\001\200\255\373\nsg20\ae",,anetGenericAccept,anetTcpAccept,acceptTcpHandler,aeProcessEvents,aeMain,main
      1 time,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
      1 ??,??,redisLog,acceptTcpHandler,aeProcessEvents,aeMain,main
      1 epoll_ctl,aeApiAddEvent,fd=750,,_installWriteEvent,addReply,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
      1 epoll_ctl,aeApiAddEvent,fd=117,,_installWriteEvent,addReply,lookupKeyReadOrReply,getGenericCommand,call,processCommand,processInputBuffer,readQueryFromClient,aeProcessEvents,aeMain,main
      1 ??,??,??

We are using redis 2.2.11.

Original comment by lpu...@insynchq.com on 5 Jul 2011 at 11:11

GoogleCodeExporter commented 8 years ago

I'm seeing similar behavior on redis 2.4.7.  Everything's going fine, and then 
suddenly redis is using 95% (jumps up from 10%) of the CPU and performance 
slows to a crawl.  The only commands running against the DB are GET, SET, 
EXPIRE, and DELETE:

https://github.com/martinrusev/django-redis-sessions/blob/master/redis_sessions/
session.py

At the point of failure there are lots of open connections:

$ netstat -tn|grep 6379|wc -l
1184

even though I stopped sending traffic to the redis server.  In addition, the 
following command just hangs and doesn't return:

$ redis-cli info

Is there a way to log all commands to see when it dies?

Original comment by tob...@caktusgroup.com on 24 Feb 2012 at 9:47

GoogleCodeExporter commented 8 years ago

This appears to be a duplicate of #648:
http://code.google.com/p/redis/issues/detail?id=648

Original comment by findch...@gmail.com on 18 Apr 2012 at 1:08

zjjfly / redis

100% CPU usage, very slow response/timeout #425