zk-ruby / zk

A High-Level wrapper for Apache's Zookeeper
MIT License
234 stars 58 forks source link

Thread hangs on exit #50

Open tsilen opened 12 years ago

tsilen commented 12 years ago

Recently we've run into an issue where a process using zk and redis_failover sometimes fails to die on exit, but instead hangs in an infinite sched_yield() loop taking all CPU. A SIGKILL is required to get rid of it. I guess it has to do with the program exiting without properly closing the connection first, but i guess it should still die cleanly.

gdb output gives this:

#0  0x00007fb7e6765a67 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb7e6bd86f7 in gvl_yield (vm=0x1bfdf20, th=<optimized out>) at thread_pthread.c:125
#2  0x00007fb7e6bdac16 in rb_thread_schedule_limits (limits_us=0) at thread.c:1025
#3  rb_thread_schedule_limits (limits_us=0) at thread.c:1033
#4  rb_thread_schedule () at thread.c:1035
#5  0x00007fb7e6bdad5f in rb_thread_terminate_all () at thread.c:375
#6  0x00007fb7e6abf89e in ruby_cleanup (ex=0) at eval.c:140
#7  0x00007fb7e6abfa25 in ruby_run_node (n=0x24f0428) at eval.c:244
#8  0x00000000004007fb in main (argc=3, argv=0x7fff7725e948) at main.c:38

After adding some debug-code to ruby side to get a backtrace when the process is hung, I was able to get this:

Thread TID-t26i0
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/threadpool.rb:268:in `worker_thread_body'

When trying to reproduce it without redis_failover i was able to get it hang in a similar way, but in a different place:

Thread TID-ccaag
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common/queue_with_pipe.rb:59:in `pop'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:56:in `get_next_event'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:94:in `dispatch_thread_body'

and

Thread TID-alg44
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:110:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:122:in `wait_while'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/client/threaded.rb:533:in `block in reconnect_thread_body'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:211:in `mon_synchronize'

Code in all is somewhat similar (@cond.wait).

Any ideas? Ruby is 1.9.3-p194, but it also happens at least on 1.9.3-p0. ZK 1.7.1, zookeeper 1.3.0. OS: linux ubuntu 12.04.

At least in ruby 1.9.3 you can give a timeout to ConditionVariable's wait(), maybe that would help.

With this I was able to get it hang in a similar way quite often:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
$stdout.sync = true
@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  zk = ZK.new(@zkservers)
  puts zk.children('/').inspect 
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
sleep rand(0)

Running it in while true; do ./test.rb; done loop until it gets stuck and then kill -TTIN prints the backtraces of the still alive threads.

tsilen commented 12 years ago

One workaround would be calling close! in at_exit { }

slyphon commented 12 years ago

Just wanted to let you know I got this. I'm not working today, but I'll try to get a fix together by the EOW.

Jonathan Simms (sent from my mobile device)

On Sep 26, 2012, at 9:07, tsilen notifications@github.com wrote:

One workaround would be calling close! in at_exit { }

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-8889029.

ryanlecompte commented 12 years ago

Any update on this one?

maxjustus commented 12 years ago

+1

slyphon commented 12 years ago

hey guys, sorry, been bogged down with holidays and nonsense, i'll try to get to this one this week

slyphon commented 12 years ago

After a quick look, it seems the problem is (as always) 1.8.7 compatibility (yes, i know, talk to @eric), where only Monitor's condvar takes a timeout. I'll have to review it a little more closely. I like the at_exit { zk.close! } idea, i'm just not sure if that will cause a memory leak, as it will keep a reference to the zk instance. It shouldn't matter in that most people are going to only use one or two zk instances, but still.

I'll try to review the places where wait() is being called and at least get a branch together that people can test against.

ryanlecompte commented 12 years ago

Thanks for taking a look! With regards to most people not having many ZK instances - that might not always be true if you're using redis_failover with a connection pool (for example, what's used in Sidekiq). Sidekiq lets you create a connection pool with e.g. 10 or 20 redis clients. Each RedisFailover::Client instance will contain an underlying ZK instance.

ryanlecompte commented 12 years ago

By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this:

  zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
  pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
  pool.with { |client| client.get('foo') }  
slyphon commented 12 years ago

this is probably how it should be done, just in terms of resource usage (open file handles, etc).

On Mon, Oct 8, 2012 at 3:37 PM, Ryan LeCompte notifications@github.comwrote:

By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this:

zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185') pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) } pool.with { |client| client.get('foo') }

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-9237528.

slyphon commented 12 years ago

Ok, so I think the test @tsilen came up with shows a bug in ruby. I've run into issues before with at-exit GC and thread primitives (in 1.8, but still). If you modify the above torture test:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
require 'logger'

$stdout.sync = true
$log = Logger.new($stderr).tap { |n| n.level = Logger::DEBUG }

@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  ZK.open(@zkservers) do |zk|
    thr_id = Thread.current.object_id.to_s(36)
    $log.debug { "#{thr_id} #{zk.children('/').inspect}" }
  end
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
# sleep rand(0)

stack.each do |th|
  begin
    th.join(3)
  rescue Exception => e
    puts "kill: #{e.class} #{e.message}"
  end
end

By waiting for the threads to exit before "falling off the end of the script" I can't get it to hang.

tsilen commented 12 years ago

Yes, the issue seems to happen when a thread is alive and in some specific state at the end of the script, so waiting for them to exit should also do the trick. I did some quick testing with the updated script and didn't get it to hang either, but I'll test a bit more tomorrow.

slyphon commented 12 years ago

Ok, cool. I'm like 99% sure that this is a bug in the way ruby does ConditionVariable cleanup.

tsilen commented 12 years ago

Looks like close! in at_exit indeed doesn't work in all cases. This is a quite a big problem, because for example rails (rack) processes many times hang when they should exit when released from e.g. Passenger's pool and keep eating a lot of resources. I think it's something that has to be handled within zk.

slyphon commented 12 years ago

Yeah, you need to close your connections before "falling off the end of the script" as above. You cannot rely on ruby to shut down these clients properly because it's not handling cleaning up POSIX thread primitives properly. I tested in an at_exit and it didn't work, I had to wait for all of the clients to finish closing and join all threads before reaching the end.

I'm not sure how to handle this in terms of passenger, but I'd imagine that there's some kind of shutdown you can hook into.

Again, there's really nothing I can do on the ZK end, as I don't control the main thread. The only other thing you can do is do your work inside of a ZK.open() block, which will ensure that zk is shutdown when the block returns.

mperham commented 11 years ago

Can you modify the code to use 1.9's condvar api when running in 1.9?

mperham commented 11 years ago

We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover.

slyphon commented 11 years ago

If you close ZK before the process exits, the hang doesn't happen because all the underlying posix stuff gets cleaned up. Clean shutdown is key.

The reason i have to use Monitor is because @eric, who's been a major contributor is still running 1.8, and I've agreed to maintain backwards compatibility. 1.8's ConditionVariable doesn't have a timeout option, which makes it really difficult.

On Fri, Dec 7, 2012 at 11:46 AM, Mike Perham notifications@github.comwrote:

We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover.

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11136886.

eric commented 11 years ago

Looking at the 1.9 code for monitor, it looks like it's just using a standard ConditionVariable and Mutex under the covers. There shouldn't be any real difference in how it acts on 1.9.

slyphon commented 11 years ago

Except that ConditionVariable and Mutex in 1.9 have real posix thread primitives backing them, and 1.9 doesn't clean up effectively before exiting, so it deadlocks.

On Fri, Dec 7, 2012 at 2:35 PM, Eric Lindvall notifications@github.comwrote:

Looking at the 1.9 code for monitor, it looks like it's just using a standard ConditionVariable and Mutex under the covers. There shouldn't be any real difference in how it acts on 1.9.

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11143151.

mperham commented 11 years ago

If you are suggesting there's a bug in Ruby 1.9, that's quite possible. I found a bug in CondVar last year, made a reproducible test case and had it fixed in head the next morning.

eric commented 11 years ago

Yes, the consensus is that there is a bug in the 1.9 Mutex/ConditionVariable implementation that causes the VM to hang if there are threads that are waiting on a ConditionVariable when the process exits.

Does sidekiq do graceful shutdowns? If so, it's likely you could work around this by calling close! on the ZK object before exiting.

mperham commented 11 years ago

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation.

eric commented 11 years ago

Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilen have described it as an "occasional" issue, which would lead me to believe neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client.

ryanlecompte commented 11 years ago

FYI, redis_failover does provide the RedisFailover::Client#shutdown method which closes the ZK connection and closes the underlying Redis clients.

On Fri, Dec 7, 2012 at 12:54 PM, Eric Lindvall notifications@github.comwrote:

Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilenhttps://github.com/tsilenhave described it as an "occasional" issue, which would lead me to believe neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client.

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11145976.

slyphon commented 11 years ago

"Where's the redmine ticket for the bug?"

You're kidding.

On Fri, Dec 7, 2012 at 3:47 PM, Mike Perham notifications@github.comwrote:

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation.

— Reply to this email directly or view it on GitHubhttps://github.com/slyphon/zk/issues/50#issuecomment-11145692.

mperham commented 11 years ago

Ryan, FYI it's not listed here which is linked from your readme: http://rubydoc.info/github/ryanlecompte/redis_failover/master/RedisFailover/Client