wooga / eredis

Erlang Redis client
MIT License
628 stars 278 forks source link

PubSub client can get overloaded #28

Closed pmembrey closed 12 years ago

pmembrey commented 12 years ago

Hi guys,

I've picked up an issue with the current pub/sub client. Basically from what I can tell, the process handling the TCP connection waits for the controlling process (the one that ack's receipt of a message) to respond before it process the next message. I guess that's to provide a level of flow control.

In my case though, we received a large batch of messages in one go (around 7,000 or so I think) this filled up the message queue for the process handling the TCP connection (it's a gen_server) which then continues to grow and grow until I kill it.

I'm not sure how to fix this apart from adding a queue internally to the gen_server so that it serves messages from that rather than keeping messages in the TCP queue...

In short, I'm receiving a burst of messages which fills up the message queue on the gen_server which never manages to recover...

Any ideas?

knutin commented 12 years ago

Hi,

The behaviour you are describing sounds very surprising to me as the design choices was made specifically to stop this exact problem.

eredis_sub_client which handles the connection does not wait for a reply from the controlling process before receiving more messages on the socket from Redis. It does wait until it gets back an ack before sending the next message to the controlling process, in the meantime it will queue messages from Redis. Pushing the queuing to the TCP layer forces Redis to do the queuing which can easily kill an instance by running out of memory. Instead, we do the queueing inside eredis, where the user can configure if you want to drop messages or kill the client when a max queue size is reached (by default, drop and infinity).

The only way for the mailbox in eredis_sub_client to grow too big is if we were using {active, true} on the socket. Maybe the problem is the performance of the queue module when dealing with big queues. I use queue:len/1 which is linear, this could be changed to keep a manual counter. process_info/1 should reveal exactly what is going on. If there are messages in the mailbox, process_info will show it.

I would be curious to know exactly what happens, as this behaviour is exactly what we don't want.

pmembrey commented 12 years ago

Hi,

Here's the process_info output. Maybe it's not the client itself but something else - I'm afraid I am far from an expert. I've removed the data shown in tcp data as its confidential but the rest is intact.

[{current_function,{eredis_parser,do_parse_multibulk,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,runnable},
 {message_queue_len,15142},
 {messages,[{tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<...>>},
            {tcp,#Port<0.5638>,
                 <<"""...>>},
            {tcp,#Port<0.5638>,
                 <<"..>>},
            {tcp,#Port<0.5638>,
                 <<""...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<"...>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,
                 <<.>>},
            {tcp,#Port<0.5638>,<<""...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,<<"...>>},
            {tcp,#Port<0.5638>,<<...>>},
            {tcp,#Port<0.5638>,...},
            {tcp,...},
            {...}|...]},
 {links,[<0.103.0>,#Port<0.5638>]},
 {dictionary,[{'$ancestors',[testapp_redis,testapp_sup,
                             <0.52.0>]},
              {'$initial_call',{eredis_sub_client,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.51.0>},
 {total_heap_size,34727220},
 {heap_size,4108475},
 {stack_size,21},
 {reductions,269218575},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,10}]},
 {suspending,[]}]

Hope that makes sense...

knutin commented 12 years ago

You're right, these are messages from a port filling up the mailbox. Just to rule out something silly, which version of eredis are you using?

I would like to reproduce it. I will try to dump a large amount of messages to Redis. How is your consumer behaving? How fast can it ack the message?

pmembrey commented 12 years ago

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....

knutin commented 12 years ago

Ok. I just put together a small example and I'm able to reproduce the behaviour you are seeing. I will investigate further.

On Thu, May 31, 2012 at 12:12 PM, pmembrey reply@reply.github.com wrote:

Hi,

It's the latest from github master branch (1.0.1 in the app file).

The receiver should be pretty fast. It pulls the message, spawns a process to handle it and then acks it. That's pretty much it.

In general I've seen it quite happy under load of thousands of messages per second. However it is possible that it might get a "burst" of say 10,000 messages or so. It looks like the sudden burst overwhelms the process to the point where it just doesn't stand a chance.

The app is always under load of at least a thousand messages and if it gets backed up a little, the continual load could easily stop it from recovering. gen_server does a selective receive which makes this process much worse (scans the whole message queue) which slows things down even more and as new messages are always being added....


Reply to this email directly or view it on GitHub: https://github.com/wooga/eredis/issues/28#issuecomment-6032263

knutin commented 12 years ago

I found a problem and was able to at least fix the problem I was able to reproduce. Let me know if you still experience problems with this or have any other problem. Thanks for reporting.

pmembrey commented 12 years ago

Deployed the update. I'll know by this time tomorrow whether this resolves the issue I came across. I'll post back as soon as I know.

Thanks for the very fast fix!

pmembrey commented 12 years ago

Hi,

I can confirm that I didn't see any problems today even though I drastically increased the traffic (i.e. more than doubled it). Historically it would fail a couple of times a day - now even with the additional pressure it's solid as a rock.

Guess we can sign this one off :)