wistia / elixir_nsq

An NSQ client for Elixir and Erlang, written in Elixir.
MIT License
90 stars 26 forks source link

Timeouts on both Consumers and Producers #3

Closed alfredbaudisch closed 8 years ago

alfredbaudisch commented 8 years ago

Even in low volumes, both my consumer and producer timeout from time to time as per the logs below. Any config params that could be tweaked to avoid this?

Publisher:

13:49:15.706 [error] GenServer NsqPublisher.Publisher terminating
** (stop) exited in: GenServer.call(#PID<0.10321.0>, {:pub, "message"}, 5000)
    ** (EXIT) time out
    (elixir) lib/gen_server.ex:604: GenServer.call/3
    (elixir_nsq) lib/nsq/producer.ex:205: NSQ.Producer.pub/2
    (nsq_publisher) lib/publisher.ex:23: NsqPublisher.Publisher.handle_cast/2
    (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:681: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

13:49:15.707 [error] GenServer #PID<0.10321.0> terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {:pub, "messages", "message"}
    (elixir) lib/string/chars.ex:3: String.Chars.impl_for!/1

    (elixir) lib/string/chars.ex:17: String.Chars.to_string/1
    (elixir_nsq) lib/nsq/connection.ex:217: NSQ.Connection.cmd/3
    (elixir_nsq) lib/nsq/producer.ex:256: NSQ.Producer.do_pub/3
    (stdlib) gen_server.erl:629: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:661: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

Consumer:

13:49:25.968 [error] GenServer #PID<0.6130.4> terminating
** (FunctionClauseError) no function clause matching in NSQ.ConnInfo.conn_id/2
    (elixir_nsq) lib/nsq/conn_info.ex:9: NSQ.ConnInfo.conn_id(#PID<0.6129.4>, "parent:#PID<0.6129.4>:conn:192.168.0.63:4150")
    (elixir_nsq) lib/nsq/consumer/connections.ex:184: NSQ.Consumer.Connections.stop_connection/3
    (elixir_nsq) lib/nsq/consumer/connections.ex:166: anonymous fn/3 in NSQ.Consumer.Connections.stop_connections/3
    (elixir) lib/enum.ex:1623: Enum."-reduce/3-lists^foldl/2-0-"/3
    (elixir_nsq) lib/nsq/consumer/connections.ex:165: NSQ.Consumer.Connections.stop_connections/3
    (elixir_nsq) lib/nsq/consumer/connections.ex:95: NSQ.Consumer.Connections.update/3
    (elixir_nsq) lib/nsq/consumer/connections.ex:82: NSQ.Consumer.Connections.discover_nsqds_and_connect/2

    (elixir_nsq) lib/nsq/consumer/connections.ex:51: NSQ.Consumer.Connections.refresh/1

13:49:25.968 [error] Task #PID<0.10873.5> started from #PID<0.10871.5> terminating
** (stop) exited in: GenServer.call(#PID<0.6135.4>, {:send, "FIN 06fbac5b95a01000\n"}, 5000)
    ** (EXIT) no process
    (elixir) lib/gen_server.ex:604: GenServer.call/3
    (elixir_nsq) lib/nsq/connection/buffer.ex:101: NSQ.Connection.Buffer.send!/2
    (elixir_nsq) lib/nsq/message.ex:99: NSQ.Message.fin/1
    (elixir_nsq) lib/nsq/message.ex:208: NSQ.Message.respond_to_nsq/2
    (elixir_nsq) lib/nsq/message.ex:158: NSQ.Message.process_without_timeout/1
    (elixir) lib/task/supervised.ex:94: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Function: #Function<1.124640675/0 in NSQ.Message.process/1>
    Args: []
MaxPower15 commented 8 years ago

Hi @alfredbaudisch, thanks for the report! Based on the stacktrace, these look to me like crashes/bugs, possibly in elixir_nsq. Can you let me know what's in your mix.exs file for elixir_nsq? That way I can trace these line numbers to see exactly what's happening.

A separate question: How frequently does this happen? And when it happens, does it reconnect and continue processing within a minute or two, or is it dead in the water?

alfredbaudisch commented 8 years ago

@MaxPower15 thanks for the prompt answer! In my mix.exs I'm adding the library from the repository, because the Hex package complains about ibrowse.app: {:elixir_nsq, github: "wistia/elixir_nsq"}.

How frequently does this happen?

It happens randomly each 20k messages or so.

Does it reconnect and continue processing within a minute or two, or is it dead in the water?

I have a Supervisor for both, so my supervisor restarts them. I also tried running the consumer without a Supervisor, and it reconnected after the random timeout.

MaxPower15 commented 8 years ago

Ah, in that case, can you tell me what's in the mix.lock for it? Looking for the exact commit SHA.

alfredbaudisch commented 8 years ago

The SHA: 676b746672fc800272e07a4ca4a725b5d58c46b3

alfredbaudisch commented 8 years ago

Additional information:

MaxPower15 commented 8 years ago

Thanks! Are you operating in a docker cluster where nsqd's may come up / go down semi-frequently? For example, something with kubernetes or a spot fleet?

My initial guess is that NSQ.Producer.pub is failing because the nsqd connection is disappearing in a way that it doesn't handle well, and that NSQ.Consumer is failing because the code path to test disappearing connections from nsqlookupd just wasn't well tested and is bugged. Does this sound possible given your setup?

alfredbaudisch commented 8 years ago

@MaxPower15 Does this sound possible given your setup? Yes.

But, just ran in dedicated VPS images, instead of Docker and getting randomly as well:

[error] GenServer #PID<0.320.0> terminating
** (stop) exited in: GenServer.call(#PID<0.321.0>, {:pub, "foo"}, 5000)
    ** (EXIT) an exception was raised:
        ** (Enum.EmptyError) empty error
            (elixir) lib/enum.ex:1553: Enum.random/1
            (elixir_nsq) lib/nsq/producer.ex:172: NSQ.Producer.random_connection_pid/1
            (elixir_nsq) lib/nsq/producer.ex:255: NSQ.Producer.do_pub/3
            (stdlib) gen_server.erl:615: :gen_server.try_handle_call/4
            (stdlib) gen_server.erl:647: :gen_server.handle_msg/5
            (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.319.0>, {{%Enum.EmptyError{message: "empty error"}, [{Enum, :random, 1, [file: 'lib/enum.ex', line: 1553]}, {NSQ.Producer, :random_connection_pid, 1, [file: 'lib/nsq/producer.ex', line: 172]}, {NSQ.Producer, :do_pub, 3, [file: 'lib/nsq/producer.ex', line: 255]}, {:gen_server, :try_handle_call, 4, [file: 'gen_server.erl', line: 615]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 647]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, {GenServer, :call, [#PID<0.321.0>, {:pub, "foo"}, 5000]}}}
State: {:state, {#PID<0.320.0>, NSQ.Producer.Supervisor}, :one_for_one, [{:child, #PID<0.341.0>, NSQ.Producer, {NSQ.Producer, :start_link, ["messages", %NSQ.Config{lookupd_poll_interval: 60000, message_handler: nil, dial_timeout: 1000, low_rdy_idle_timeout: 10000, output_buffer_size: 16384, rdy_redistribute_interval: 5000, tls_cert: nil, max_requeue_delay: 900000, backoff_strategy: :exponential, tls_min_version: nil, backoff_multiplier: 1000, write_timeout: 1000, deflate_level: 6, heartbeat_interval: 30000, tls_key: nil, sample_rate: 0, tls_v1: false, max_reconnect_attempts: 30, tls_insecure_skip_verify: false, max_attempts: 5, max_in_flight: 2500, client_id: nil, nsqlookupds: [], lookupd_poll_jitter: 0.3, msg_timeout: 60000, read_timeout: 60000, auth_secret: "", user_agent: nil, snappy: false, hostname: nil, deflate: false, max_backoff_duration: 120000, nsqds: ["127.0.0.1:4150"], event_manager: nil, rdy_retry_delay: 5000, output_buffer_timeout: 250}]}, :permanent, 5000, :worker, [NSQ.Producer]}], :undefined, 3, 5, [-576460339], 0, NSQ.Producer.Supervisor, {"messages", %NSQ.Config{lookupd_poll_interval: 60000, message_handler: nil, dial_timeout: 1000, low_rdy_idle_timeout: 10000, output_buffer_size: 16384, rdy_redistribute_interval: 5000, tls_cert: nil, max_requeue_delay: 900000, backoff_strategy: :exponential, tls_min_version: nil, backoff_multiplier: 1000, write_timeout: 1000, deflate_level: 6, heartbeat_interval: 30000, tls_key: nil, sample_rate: 0, tls_v1: false, max_reconnect_attempts: 30, tls_insecure_skip_verify: false, max_attempts: 5, max_in_flight: 2500, client_id: nil, nsqlookupds: [], lookupd_poll_jitter: 0.3, msg_timeout: 60000, read_timeout: 60000, auth_secret: "", user_agent: nil, snappy: false, hostname: nil, deflate: false, max_backoff_duration: 120000, nsqds: ["127.0.0.1:4150"], event_manager: nil, rdy_retry_delay: 5000, output_buffer_timeout: 250}}}
MaxPower15 commented 8 years ago

That one seems similar in that it's a problem with how we handle unexpected numbers of connections (in this case: 0). Also something we should fix. Based on the connections being empty--and since it isn't an immediate error--I would guess the nsqlookupd you're connected to suddenly reported no nsqds. Was the configuration pointing to nsqlookupd in this most recent case, or nsqd directly?

I will look into handling these errors in the next couple day or so. If we're handling responses from nsqlookupd incorrectly, that would be bet. But my hunch is we're handling cases we didn't expect poorly. On the bright side, if you're supervisors are setup such that these crash and recover, I think the overall behavior should be pretty much the same as if we handle the case correctly--just less noisy, hehe. Then again, if it's not completely crashing, then that's bad news and something I should look deeper into.

Thanks again for the info!

MaxPower15 commented 8 years ago

Hi @alfredbaudisch, some updates on debugging.

First, I created a "stress test" locally where I feed in 25,000 messages and verify they were all received. This seemed to be working without error, although I'm not sure I'm operating under the same conditions as you.

For your most recent producer error, I'm looking primarily at this line and the code it depends on: https://github.com/wistia/elixir_nsq/blob/master/lib/nsq/producer.ex#L172. For that error to occur, get_connections/1 must be returning an empty list.

In your stacktrace, the thing that stands out to me is that the failing GenServer.call is targeting #PID<0.321.0>, but the reported state of the supervisor lists its pid as #PID<0.320.0>. Given the state, I'd expect GenServer.call(pid(0, 320, 0), ...) to work. #PID<0.320.0> should be a Supervisor process, and we query its children to get available connections using Supervisor.which_children/1. elixir_nsq does not start/stop supervisors on its own--that's what you're doing when you call NSQ.Producer.start_link--so my hunch is that there's some duplicate producer being started up somewhere else in your app. Does that ring any bells for you?

I still need to look into the consumer behavior. If my analysis doesn't seem right you, it would be great to get a short code snippet along with your NSQ version and config to help me reproduce!

alfredbaudisch commented 8 years ago

@MaxPower15 everything makes sense, specially when get_connections/1 returns an empty list. If my nsq instance keeps dropping, then it has nothing to do with the library. But it drops both in a container and directly installed into the machine.

I prepared a sample repository for both the Publisher and Consumer, reproducing the setup, which also will help answering your second point: https://github.com/alfredbaudisch/elixir_nsq_timeout_test

MaxPower15 commented 8 years ago

Thanks! For the consumer issue, I think that should be fixed with https://github.com/wistia/elixir_nsq/commit/78ff4044d84276f461b77c7acb9ac206e0b44822. None of the tests had covered stop_connection, and the input we were passing into it was wrong.

Looking at that sample code now, thanks.

alfredbaudisch commented 8 years ago

With my sample code I'm consistently being able to reproduce the Producer timeout when reaching the 20k~30kish message when running: ab -n 40000 -c 100 "http://192.168.0.63:4000/api/publish".

{'** Generic server ~p terminating \n** Last message in was ~p~n** When Server state == ~p~n** Reason for termination == ~n
** ~p~n', [NsqPublisher.Publisher, {:"$gen_cast", {:publish, "message"}}, #PID<0.415.0>, {{:timeout, {GenServer, :call, [#PID<0.416.0>, {:pub, "message"}, 5000]}}, [{GenServer, :call, 3, [file: 'lib
/gen_server.ex', line: 604]}, {NSQ.Producer, :pub, 2, [file: 'lib/nsq/producer.ex', line: 205]}, {NsqPublisher.Publisher, :handle_cast, 2, [file: 'lib/publisher.ex', line: 23]}, {:gen_server, :try_d
ispatch, 4, [file: 'gen_server.erl', line: 615]}, {:gen_server, :handle_msg, 5, [file: 'gen_server.erl', line: 681]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}]})

It's the same timeout as the initial issue posted, but directly reproducible now.

alfredbaudisch commented 8 years ago

I also just noticed that when the stress tests are running, if I refresh too much the nsqadmin dashboard, NSQ loading logo gets struck then the nsq instances crash, generating the GenServer timeout! Looks like a bug in NSQ itself! Not related to the loading logo of course, but showing that it gets struck for something it should do without a sweat – opening the dashboard.

alfredbaudisch commented 8 years ago

Basically so far:

  1. GenServer timeout when running a stress test (a stress test that is not that much "stressed").
  2. nsq producers crashing when the nsqadmin dashboard is opened, thus timing out the producer GenServer.

2 seems definitely as something related to NSQ itself, but what about 1? That seems related to configuration tweaking that the library is not handling well?

MaxPower15 commented 8 years ago

Hm, I'm a bit at a loss on this one so far. I cloned your project and followed the directions. Ran the stress test, but it seems like ab is timing out before I can get an error to appear:

➜  ~ ab -n 40000 -s 60 -c 100 "http://127.0.0.1:4000/api/publish"
This is ApacheBench, Version 2.3 <$Revision: 1663405 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 4000 requests
Completed 8000 requests
Completed 12000 requests
Completed 16000 requests
apr_socket_recv: Operation timed out (60)
Total of 16441 requests completed

I think I'm exhausting the max open connections for phoenix. 😛 I am sometimes able to get it to finish, but no errors there either. If it's exhausting phoenix, something about that feels weird because it seems like everything is going through your GenServer.cast, so it should all be totally async. The only other thing I can think of, is maybe there's some big GC happening that's slowing it down in random places? The message queue must be pretty large.

I tested this on OSX Yosemite with NSQ from docker (per your instructions), as well as my local version (0.3.6) with the same results.

If you have any more suggestions/ideas for how to reproduce this, let me know!

alfredbaudisch commented 8 years ago

Oh, I'm not getting the ab timeout the way you did.

so it should all be totally async.

Yes!

I'll see if there is something more where I can help.

alfredbaudisch commented 8 years ago

I added Benchwarmer to the Publisher { :benchwarmer, "~> 0.0.2" }, and then ran in Iex multiple times:

Benchwarmer.benchmark fn -> NsqPublisher.Publisher.publish("foo") end

More than a million messages were queued, and no crash whatsoever.

So it seems that what is really causing the timeouts is when calling by the Phoenix endpoint. But I still don't understand the reasons, since in either case (iex and Phoenix), the call to NSQ is async!

Only got this once in the Consumer:

19:58:41.547 [error] error: Unknown frame type 1694499120
<<54, 102, 100, 51, 55, 53, 55, 48, 51, 53, 102, 54, 48, 48, 48, 102, 111, 111, 0, 0, 0, 33, 0, 0, 0, 2, 20, 128, 150>>

19:59:41.499 [error] GenServer #PID<0.156.0> terminating
** (stop) exited in: GenServer.call(#PID<0.154.0>, {:recv, 2800223088}, 60000)
    ** (EXIT) time out

EDITED: scheduled more than 20M messages with this method and no timeout whatsoever. Definitely just when by calling from the API endpoint, which is actually a real use case for our project – scheduling messages from the API calls :(!

MaxPower15 commented 8 years ago

Yeah, I'm tempted to close this because it seems like the root bottleneck is not with elixir_nsq, but I think this is such a common use case that it should be addressed.

As you saw, I'm having issues maxing out phoenix using ab--I'm looking into that separately. But since it prevents me from doing more testing, I was wondering how it fares if you use NSQ.Producer.pub directly in the controller? This would make it so publishing it's not completely async like you have it now, but the async piece seems to me like it might duplicate the purpose of NSQ, which is to be the thing that queues the incoming data. IMO having backpressure there would be nice so you can avoid maxing out the message queue, and so you can know when you need to spin up more publisher servers.

MaxPower15 commented 8 years ago

Found the issue causing my timeouts. Not exactly related to phoenix:

Using the -k and changing my sysctl allowed me to do larger benchmarks. I did 200k requests in 216 seconds:

➜  ~ ab -k -n 200000 -s 60 -c 100 "http://127.0.0.1:4000/api/publish"
This is ApacheBench, Version 2.3 <$Revision: 1663405 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 20000 requests
Completed 40000 requests
Completed 60000 requests
Completed 80000 requests
Completed 100000 requests
Completed 120000 requests
Completed 140000 requests
Completed 160000 requests
Completed 180000 requests
Completed 200000 requests
Finished 200000 requests

Server Software:
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /api/publish
Document Length:        4 bytes

Concurrency Level:      100
Time taken for tests:   215.702 seconds
Complete requests:      200000
Failed requests:        0
Keep-Alive requests:    198050
Total transferred:      53153200 bytes
HTML transferred:       800000 bytes
Requests per second:    927.21 [#/sec] (mean)
Time per request:       107.851 [ms] (mean)
Time per request:       1.079 [ms] (mean, across all concurrent requests)
Transfer rate:          240.64 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       3
Processing:    19  108  62.9     94     867
Waiting:       19  108  62.9     94     867
Total:         19  108  62.9     94     867

Percentage of the requests served within a certain time (ms)
  50%     94
  66%    125
  75%    143
  80%    151
  90%    178
  95%    200
  98%    257
  99%    340
 100%    867 (longest request)

No errors for me unfortunately, so still having trouble reproducing.

For that "Unknown frame type" error, that would mean either there's some issue with NSQ.Connection.Buffer (this is where incoming/outgoing data is gathered and flushed when we have a complete NSQ packet), or the data being sent over the NSQ socket was corrupted. Looking at the elixir_nsq code, I'm not immediately sure how it could be dealing with it incorrectly--all we do is wait for 4 bytes for the size of the message, then wait until we have the full message. Rinse and repeat. That makes me think maybe some data coming through the socket was corrupt. That shouldn't happen frequently, but I've seen it in production.

MaxPower15 commented 8 years ago

BTW, I just did a benchmark using NSQ.Producer.pub directly. I did it by removing the NsqPublisher.Publisher worker and replacing it with

    supervisor(NSQ.Producer.Supervisor, ["messages", %NSQ.Config{nsqds: ["127.0.0.1:4150"]}, [name: NSQ.Producer.Supervisor]])

Then I used NSQ.Producer.pub(NSQ.Producer.Supervisor, "message") in the /api/publish. Seems a little bit faster than using GenServer.cast, took 162 seconds to run:

➜  ~ ab -k -n 200000 -s 60 -c 100 "http://127.0.0.1:4000/api/publish"
This is ApacheBench, Version 2.3 <$Revision: 1663405 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 20000 requests
Completed 40000 requests
Completed 60000 requests
Completed 80000 requests
Completed 100000 requests
Completed 120000 requests
Completed 140000 requests
Completed 160000 requests
Completed 180000 requests
Completed 200000 requests
Finished 200000 requests

Server Software:
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /api/publish
Document Length:        4 bytes

Concurrency Level:      100
Time taken for tests:   162.443 seconds
Complete requests:      200000
Failed requests:        0
Keep-Alive requests:    198048
Total transferred:      53153152 bytes
HTML transferred:       800000 bytes
Requests per second:    1231.20 [#/sec] (mean)
Time per request:       81.222 [ms] (mean)
Time per request:       0.812 [ms] (mean, across all concurrent requests)
Transfer rate:          319.54 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       3
Processing:    15   81  48.4     59     591
Waiting:       15   81  48.4     59     591
Total:         15   81  48.4     59     591

Percentage of the requests served within a certain time (ms)
  50%     59
  66%     92
  75%    114
  80%    121
  90%    141
  95%    155
  98%    187
  99%    222
 100%    591 (longest request)

This makes sense to me if sending things through the socket wasn't the bottleneck. If it wasn't, then there's just less going on. I'm not absolutely sure that would hold up in production since message sizes will probably be larger, but it would also make the required memory less spikey.

alfredbaudisch commented 8 years ago

@MaxPower15 sorry for the delay!

But since it prevents me from doing more testing, I was wondering how it fares if you use NSQ.Producer.pub directly in the controller?

That worked!

all we do is wait for 4 bytes for the size of the message, then wait until we have the full message. Rinse and repeat. That makes me think maybe some data coming through the socket was corrupt. That shouldn't happen frequently, but I've seen it in production.

I never saw that error before anywhere and found few information about it on Google. At the same time, it happened only once after all the stress tests, but good know that it may be related to data going through the socket.

Using the -k and changing my sysctl allowed me to do larger benchmarks. I did 200k requests in 216 seconds / Then I used NSQ.Producer.pub(NSQ.Producer.Supervisor, "message") in the /api/publish. Seems a little bit faster than using GenServer.cast, took 162 seconds to run

Getting roughly the same results here now. Also removing from the GenServer, similarly to your tips, solved the Timeout issues.

As experiment I also tried making each .pub inside a task:

Task.start(fn -> NSQ.Producer.pub(NSQ.Producer.Supervisor, "message") end)

And no timeouts either.

alfredbaudisch commented 8 years ago

@MaxPower15 Thanks for all the detailed help and attention! I think we can wrap up! Running direct into the Controller is working.

MaxPower15 commented 8 years ago

Thanks @alfredbaudisch, glad that worked for you! Note that I made a similar internal change to use GenServer.call here: https://github.com/wistia/elixir_nsq/commit/5934956d1805c586adc0e94315056c0bedc8718f. This should enforce backpressure when it makes sense instead of allowing a process mailbox to potentially overflow and crash or lose messages, which is my hunch as to what happened.