zombocom / puma_worker_killer

Automatically restart Puma cluster workers based on max RAM available
748 stars 77 forks source link

Continued cycling/killing of puma workers (wrong ones?) #14

Closed travisp closed 4 years ago

travisp commented 9 years ago

Running 5 puma workers on a 2X heroku dyno, we encountered a situation when the memory ran out (after many hours) where puma_worker_killer was continually killing and cycling workers with indexes 1-3. The memory was not dropping to an acceptable level, and puma_worker_killer did not try sending a TERM to the other workers. My best guess is that the reported memory usage for the individual workers was wrong (i.e. largest_worker was wrong) and so it was killing the wrong ones. This has happened multiple times for us.

It seems the potential mitigating solutions are:

schneems commented 9 years ago

Check out https://github.com/schneems/puma_worker_killer/issues/8#issuecomment-107673407

It's due to the way CoW optimized Ruby apps work. It sounds like you've got a leak or other memory problem that PWK is covering up. Have you seen https://github.com/schneems/derailed_benchmarks ? Maybe try to decrease your overall memory use.

travisp commented 9 years ago

I don't disagree that we've likely got a memory leak that becomes an issue after several hours. We're doing what we can to identify the issue, but it would be nice if PWK could help with this issue in the meantime, as the read me suggests:

If you have a memory leak in your code, finding and plugging it can be a herculean effort. Instead what if you just killed your processes when they got to be too large? The Puma Worker Killer does just that. Similar to Unicorn Worker Killer but for the Puma web server.

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

schneems commented 9 years ago

Would you be open to some sort of max_requests option, similar to unicorn worker killer?

I honestly don't know. It would depend on the implementation. The UWK relies on some metaprogramming to open up Unicorn classes to record the count. My gut instinct is that it won't fit into the current core codebase and might be a better separate lib but IDK.

Would killing off workers after x minutes instead of x requests work for your use case? I think that would be much easier, spawn a new thread, record the current time, every 10 seconds wake up to see if it's past the maximum time. When that occurs do a rolling worker restart (kill each worker with a slight delay between them so throughput doesn't just go away.

travisp commented 9 years ago

@schneems I think allowing a rolling worker restart after some period of time could be a very helpful solution. We've managed to mostly control the problem for now through some pretty aggressive tuning of Ruby GC settings (at a moderately large cost to performance), but I could see still being useful to us and others at times.

schneems commented 9 years ago

Great, I think that such a feature will be fairly straight forward. I'm off on paternity leave right now. I'll be back after July 15th. If you don't hear from me before then, would you mind pinging this thread?

forrestblount commented 9 years ago

I would also be interested in this rolling restart solution. We use some pdf manipulation libraries that always end up chewing into swap.

schneems commented 9 years ago

Take a look at https://github.com/schneems/puma_worker_killer/pull/16 if it looks good and tests pass, i'll merge it into master and you can try it out.

forrestblount commented 9 years ago

Doesn't look like the tests pass, but have set it up in our staging environment and will let you know if I have any issues.

forrestblount commented 9 years ago

Because I'm using this on Heroku I added an initializer with only one line: PumaWorkerKiller.enable_rolling_restart(4 * 3600)

It doesn't look like it is restarting every few hours though. Did I misunderstand the documentation?

schneems commented 9 years ago

Are you using any kind of logging addon like papertrail? You can grep for "PumaWorkerKiller: Rolling Restart". You can also try it locally and set to something really low, like 10 seconds just to verify it's set up correctly.

forrestblount commented 9 years ago

We use logentries. It's definitely not working for me. Even if I use the full config, I never seeing rolling restarts, even locally with 10s.

Things I've tried:

PumaWorkerKiller.config do |config| config.ram = 6144 # mb config.frequency = 5 # seconds config.percent_usage = 0.98 config.rolling_restart_frequency = 10 #4 * 3600 end PumaWorkerKiller.start

Could it be related to my puma server configuration? Wondering if the @cluster isn't getting set correctly -- that seems to be the only place where the Rolling Restart would kill itself.

Here's my puma config (v 2.12.2): workers Integer(ENV['WEB_CONCURRENCY'] || 2) threads_count = Integer(ENV['MAX_THREADS'] || 5) threads threads_count, threads_count

preload_app!

rackup DefaultRackup port ENV['PORT'] || 3000 environment ENV['RACK_ENV'] || 'development'

on_worker_boot do ActiveSupport.on_load(:active_record) do ActiveRecord::Base.establish_connection end end

Thanks in advance for your help.

schneems commented 9 years ago

I realized I didn't call start on the AutoReap.new instance. It's fixed in the branch and should be executing now. Unfortunately my tests are pretty non-deterministic. Once I get those sorted out i'll merge. Thanks for the feedback and your patience.

forrestblount commented 9 years ago

Thanks for creating this helpful tool! I can confirm it's now working as expected.

schneems commented 9 years ago

:rocket: I'm curious. What do you think a good default reap time would good be? I think i've currently got it set for 12 hours. I want to not have to bring down your workers more than needed since it does affect throughput when it happens, however I don't want to wait so long that you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the reaper kicks in? I put some randomization in there so not every server tries to restart at the same time, but i'm not sure if it's enough (it's plus/minus 5 seconds). If you're seeing a big drop in throughput we might want to make this a larger delta. You might not see a problem until adding a large number of servers, or maybe it's not that bad.

forrestblount commented 9 years ago

12 hours isn't bad for a default, but it really depends on the size of the dyno and number of workers running. For us to stay on the 2x dynos, I've got it set to run every 4 hours in staging.

Because we're typically running < 5 dynos, I am seeing some issues with throughput. I think a delta of 1-3m would work better here - the idea being that only 1-2 dynos are ever out of commission at the same time. Is this a moot point if the app has preboot enabled? I've only been testing in staging so far (where I have noticed throughput dropping to near 0, but where we typically run 1-2 dynos).

On Fri, Aug 14, 2015 at 3:22 PM Richard Schneeman notifications@github.com wrote:

[image: :rocket:] I'm curious. What do you think a good default reap time would good be? I think i've currently got it set for 12 hours. I want to not have to bring down your workers more than needed since it does affect throughput when it happens, however I don't want to wait so long that you're swapping. 2x a day seemed like an okay guess.

Second question: did you see any noticeable dip in throughput when the reaper kicks in? I put some randomization in there so not every server tries to restart at the same time, but i'm not sure if it's enough (it's plus/minus 5 seconds). If you're seeing a big drop in throughput we might want to make this a larger delta. You might not see a problem until adding a large number of servers, or maybe it's not that bad.

— Reply to this email directly or view it on GitHub https://github.com/schneems/puma_worker_killer/issues/14#issuecomment-131254945 .

jvenezia commented 8 years ago

Hi,

I'm trying the exact same thing than @forrestblount in a local development env. I can't see any restart in the server logs, and @cluster.running? is always returning nil witch I assume prevents restarting my workers.

Any Idea why? Thanks!

schneems commented 8 years ago

The thread that basically sleeps for 12 hours is probably either not getting started, or getting started in a weird place. Is preload app set to true on your puma? I think when it isn't enabled there's an edgecase we're not properly handling.

jvenezia commented 8 years ago

Hi, I'm using app preload. Here is my puma settings:

workers Integer(ENV['PUMA_WORKERS'] || 2)
threads_count = Integer(ENV['PUMA_WORKER_THREADS'] || 5)
threads threads_count, threads_count

preload_app!

rackup DefaultRackup
port ENV['PORT'] || 3000
environment ENV['RACK_ENV'] || 'development'

on_worker_boot do
  ActiveSupport.on_load(:active_record) do
    config = ActiveRecord::Base.configurations[Rails.env] || Rails.application.config.database_configuration[Rails.env]
    config['pool'] = threads_count
    ActiveRecord::Base.establish_connection(config)
  end
end
schneems commented 8 years ago

Can you give me an example app that reproduces this in development?