Memory leak when creating and terminating many producers.

leklund commented 7 years ago

In tracking down memory leaks in our sidekqiq workers I've discovered some problems with nsq-ruby. It appear to be leaking memory when creating and terminating large numbers of Nsq::Producer objects. The script I'm using is a bit contrived but it mimics the behavior of a sidekiq worker that was creating a new Producer per job.

First, there seems to be an issue with closing sockets. When the folloing code is run it will work for about 30k iterations before blowing up with Too many open files.

require 'nsq'
require 'get_process_mem'

ENV['NSQD_TCP_ADDRESS']='nsqd:1910'

opts = {
  nsqd: ENV['NSQD_TCP_ADDRESS'],
  topic: 'some_topic',
}

n = 100_000

nsq_producer = Nsq::Producer.new(opts)

n.times do |i|
  nsq_producer = Nsq::Producer.new(opts)
  nsq_producer.write("{'message':'OHAI'}")

  nsq_producer.terminate

  puts "#{i}\t#{GetProcessMem.new.kb}" if i % 100 == 0
end

puts "done\t#{GetProcessMem.new.kb}"

Output:

0        16636.0
100      18892.0
200      19992.0
300      20304.0
400      20372.0
500      20404.0
600      20412.0
700      20428.0
800      20444.0
900      20460.0
1000     20476.0
...
10000    43220.0
10100    43228.0
10200    43232.0
10300    43252.0
...
31900    69476.0
32000    69476.0
32100    69476.0
32200    70116.0
32300    71068.0
32400    71948.0
32500    73172.0
32600    75168.0
32700    77520.0
32800    79824.0
32900    82180.0
33000    84532.0
/home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/connection.rb:335:in `initialize': Too many open files - getaddrinfo (Errno::EMFILE)
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/connection.rb:335:in `new'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/connection.rb:335:in `open_connection'
    from /home/vagrant/.rvm/gems/ruby-1.3.0/gems/nsq-ruby-2.0.3/lib/nsq/connection.rb:68:in `initialize'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/client_base.rb:90:in `new'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/client_base.rb:90:in `add_connection'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/producer.rb:25:in `block in initialize'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/producer.rb:25:in `each'
    from /home/vagrant/.rvm/gems/ruby-2.3.0/gems/nsq-ruby-2.0.3/lib/nsq/producer.rb:25:in `initialize'

When I add @socket.close before this line https://github.com/wistia/nsq-ruby/blob/master/lib/nsq/connection.rb#L360, the code will run until completion but will continue to grow in memory usage.

The usage is steady for a while and then jumps significantly in spots as it iterates.

0        18628.0
100      20880.0
200      21960.0
300      22340.0
400      22444.0
500      22456.0
...
15700    47768.0
15800    48808.0
15900    50768.0
16000    53160.0
16100    55460.0
16200    58108.0
16300    60908.0
16400    63720.0
16500    66272.0
16600    67216.0
16700    68196.0
16800    69256.0
16900    70420.0
...
62700    141256.0
62800    143492.0
62900    145812.0
63000    148052.0
63100    150248.0
63200    152448.0
63300    154256.0
63400    156032.0
63500    157796.0
63600    159560.0
63700    161336.0
63800    163072.0
63900    164848.0
64000    166628.0
64100    168348.0
64200    170096.0
64300    172320.0
64400    174676.0
64500    177012.0
64600    179376.0
64700    181712.0
64800    184056.0
64900    186372.0
...
99500    208380.0
99600    208380.0
99700    208388.0
99800    208396.0
99900    208400.0
done     208408.0

I'm opening a PR for the @socket.close bug. I haven't seen an obvious fix to the memory leak. My solution for our app is to re-use the producers instead of re-creating as needed.

Tested using:

ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
nsq-ruby 2.0.3

bschwartz commented 7 years ago

Fixed via #36. Thanks @leklund!

bschwartz commented 7 years ago

Oops, was going quickly and didn't realize there were two issues in one here.

It is disconcerting that this would cause lots of memory usage.

Like you said, reusing producers is an easy fix for this, but I'd still like to take a look at what's causing this memory not to be freed up.

bschwartz commented 7 years ago

@leklund Looks like the culprit is the at_exit handler that each new producer registers. Simplest solution would be to just remove that. I think that generally should be fine. Any particular opinions on removing that @freerobby or @alieander?

Here's the relevant line: https://github.com/wistia/nsq-ruby/blob/master/lib/nsq/producer.rb#L31

leklund commented 7 years ago

@bschwartz I can't think of any reason why removing that hander would cause issues.

freerobby commented 7 years ago

@bschwartz Will take a closer look -- only initial question is, if we remove at_exit, are we assuming people will close their connections manually?

@leklund Thanks for the socket fix. I wonder if I should add a test that sets a low open file limit (i.e. via ulimit -n) and opens and closes a bunch of connections to ensure we don't regress here.

bschwartz commented 7 years ago

@freerobby Yes, exactly. I think that's probably fine as we already instruct folks to terminate their producers and consumers in the docs.

I didn't realize it would hold on to the memory like that when I initially added the at_exit handlers. Should have realized that!

I'm going to remove them when I get a chance.

wistia / nsq-ruby

Memory leak when creating and terminating many producers. #34