zombocom / puma_worker_killer

Automatically restart Puma cluster workers based on max RAM available
749 stars 77 forks source link

Rolling restart uses pids that were killed instead of new pids #46

Closed nritholtz closed 4 years ago

nritholtz commented 7 years ago

Here is the related puma_worker_killer section from our puma config:

before_fork do
  ActiveRecord::Base.connection_pool.disconnect!
  unless Rails.env.development?
    PumaWorkerKiller.config do |config|
      config.ram           = 512 # mb
      config.frequency     = 10    # seconds
      config.percent_usage = 0.98
      config.rolling_restart_frequency = 6 * 3600 # 6 hours in seconds
    end
    PumaWorkerKiller.start
  end
end

Just for informational sake, I am running puma with 2 workers with min 1 max 5 threads.

I found that the Rolling Restart feature was sending TERM to a pid that was already killed due to Out of memory, instead of the newly spawned pid.

Here is the related excerpt from our logs:

00:51:49 [15] PumaWorkerKiller: Out of memory. 2 workers consuming total: 502.4765625 mb out of max: 501.76 mb. Sending TERM to pid 3047 consuming 177.96484375 mb.
...........................
03:47:37 [15] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 493.609375 mb out of max: mb. Sending TERM to pid 3047.

Here is the related kill/boot-up excerpt:

00:51:49 [15] PumaWorkerKiller: Out of memory. 2 workers consuming total: 502.4765625 mb out of max: 501.76 mb. Sending TERM to pid 3047 consuming 177.96484375 mb.

00:51:49 [15] - Worker 1 (pid: 3940) booted, phase: 0

00:51:59 [15] PumaWorkerKiller: Consuming 456.82421875 mb with master and 2 workers.

So from what I see, the Rolling Restart should have actually killed pid 3940 instead of 3047.

schneems commented 7 years ago

I've not tested running both OOM and rolling restarts together. I would believe there's a bug here.

prashantvithani commented 7 years ago

It happens with only Rolling restart as well

Look at the PIDs to which TERM is sent

[15908] PumaWorkerKiller: Consuming 3078.51953125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.51953125 mb. Sending TERM to pid 15916.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.53125 mb. Sending TERM to pid 15913.
[15908] PumaWorkerKiller: Consuming 3078.56640625 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.56640625 mb with master and 2 workers.

The PIDs of the workers are same

deploy   15908  0.0  3.6 2120036 277748 ?      Sl   04:26   0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]                                   
deploy   15913  0.6 20.2 3504200 1554988 ?     Sl   04:26   4:57 puma: cluster worker 0: 15908 [20170801151810]                                                                      
deploy   15916  0.6 17.2 3189164 1319880 ?     Sl   04:26   4:35 puma: cluster worker 1: 15908 [20170801151810] 
schneems commented 7 years ago

A bit of trivia: PIDs can be recycled by the OS, so even if one is killed a new one may have the exact same PID number.

That's not what I think is happening though.

I'm looking at this example:

[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.51953125 mb. Sending TERM to pid 15916.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.53125 mb. Sending TERM to pid 15913.

Those are two different numbers 15913 and 15916.

I'm not sure where the other numbers in the second code paste are coming from. What am I looking at here:

deploy   15908  0.0  3.6 2120036 277748 ?      Sl   04:26   0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]                                   
deploy   15913  0.6 20.2 3504200 1554988 ?     Sl   04:26   4:57 puma: cluster worker 0: 15908 [20170801151810]                                                                      
deploy   15916  0.6 17.2 3189164 1319880 ?     Sl   04:26   4:35 puma: cluster worker 1: 15908 [20170801151810] 
prashantvithani commented 7 years ago
User      PID
deploy   15908  0.0  3.6 2120036 277748 ?      Sl   04:26   0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]                                   
deploy   15913  0.6 20.2 3504200 1554988 ?     Sl   04:26   4:57 puma: cluster worker 0: 15908 [20170801151810]                                                                      
deploy   15916  0.6 17.2 3189164 1319880 ?     Sl   04:26   4:35 puma: cluster worker 1: 15908 [20170801151810]

👆 This is the output of ps aux | grep puma after Rolling Restart was done.

brandonkboswell commented 6 years ago

I have experienced this as well. Unfortunately didn't get a screenshot. I will attempt to get another.

brandonkboswell commented 6 years ago

Here's an example of PWK trying to kill off a PID that appears to already have been attempted to be killed.


16:07:32 web.1  | [78523] * Listening on tcp://0.0.0.0:5000
16:07:32 web.1  | [78523] Use Ctrl-C to stop
16:07:32 web.1  | [78523] - Worker 0 (pid: 78544) booted, phase: 0
16:07:32 web.1  | [78523] - Worker 1 (pid: 78545) booted, phase: 0
16:07:49 web.1  | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 138.421875 mb out of max:  mb. Sending TERM to pid 78545.
16:07:49 web.1  | [78523] - Worker 1 (pid: 78627) booted, phase: 0
16:08:49 web.1  | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 140.87109375 mb out of max:  mb. Sending TERM to pid 78544.
16:08:49 web.1  | [78523] - Worker 0 (pid: 78896) booted, phase: 0
16:10:07 web.1  | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 142.6328125 mb out of max:  mb. Sending TERM to pid 78627.
16:10:07 web.1  | [78523] - Worker 1 (pid: 79303) booted, phase: 0
16:11:07 web.1  | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 141.8203125 mb out of max:  mb. Sending TERM to pid 78544.
ivanovv commented 6 years ago

Another report of the same issue.

It looks like that the easiest way to repro this one is to have 1 worker and set the restart interval to something like 10s.

Let me know if there is anything I can help with.

#!/usr/bin/env puma

environment 'development'
threads 3,3

workers 1

worker_timeout 30
prune_bundler

before_fork do
  require 'puma_worker_killer'
  PumaWorkerKiller.enable_rolling_restart(30)
end

One worker:

* Pruning Bundler environment
[72542] Puma starting in cluster mode...
[72542] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[72542] * Min threads: 3, max threads: 3
[72542] * Environment: development
[72542] * Process workers: 1
[72542] * Phased restart available
[72542] * Listening on tcp://0.0.0.0:9292
[72542] Use Ctrl-C to stop
[72549] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72549) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.38671875 mb. Sending TERM to pid 72549.
[72560] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72560) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 189.171875 mb. Sending TERM to pid 72549.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 189.671875 mb. Sending TERM to pid 72560.
[72602] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72602) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 181.3046875 mb. Sending TERM to pid 72560.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 152.2109375 mb. Sending TERM to pid 72602.
[73130] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 73130) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.41015625 mb. Sending TERM to pid 72602.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 152.9453125 mb. Sending TERM to pid 73130.
[73195] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 73195) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 178.2421875 mb. Sending TERM to pid 73130.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.00390625 mb. Sending TERM to pid 73195.
[73249] + Gemfile in context: /Users/user/code/project/Gemfile
^C[72542] - Gracefully shutting down workers...
[73249] Early termination of worker
[72542] === puma shutdown: 2018-04-22 11:46:59 +0300 ===
[72542] - Goodbye!

Two workers:

#!/usr/bin/env puma

environment 'development'
threads 3,3

workers 2

worker_timeout 30
prune_bundler
before_fork do
  require 'puma_worker_killer'
  PumaWorkerKiller.enable_rolling_restart(30)
end
* Pruning Bundler environment
[73262] Puma starting in cluster mode...
[73262] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[73262] * Min threads: 3, max threads: 3
[73262] * Environment: development
[73262] * Process workers: 2
[73262] * Phased restart available
[73262] * Listening on tcp://0.0.0.0:9292
[73262] Use Ctrl-C to stop
[73357] + Gemfile in context: /Users/user/code/project/Gemfile
[73358] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73357) booted, phase: 0
[73262] - Worker 1 (pid: 73358) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 371.06640625 mb. Sending TERM to pid 73358.
[73380] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73380) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 359.234375 mb. Sending TERM to pid 73357.
[73409] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73409) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 390.625 mb. Sending TERM to pid 73357.
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 405.01171875 mb. Sending TERM to pid 73380.
[73472] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73472) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 392.73046875 mb. Sending TERM to pid 73380.
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 392.875 mb. Sending TERM to pid 73409.
[73528] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73528) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 360.69921875 mb. Sending TERM to pid 73472.
[73559] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73559) booted, phase: 0
^C[73262] - Gracefully shutting down workers...
[73262] === puma shutdown: 2018-04-22 11:56:56 +0300 ===
[73262] - Goodbye!
* Pruning Bundler environment
[73624] Puma starting in cluster mode...
[73624] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[73624] * Min threads: 3, max threads: 3
[73624] * Environment: development
[73624] * Process workers: 2
[73624] * Phased restart available
[73624] * Listening on tcp://0.0.0.0:9292
[73624] Use Ctrl-C to stop
[73629] + Gemfile in context: /Users/user/code/project/Gemfile
[73630] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 220.21875 mb. Sending TERM to pid 73629.
[73629] Early termination of worker
[73639] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73630) booted, phase: 0
[73624] - Worker 0 (pid: 73639) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 364.67578125 mb. Sending TERM to pid 73630.
[73655] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73655) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 402.80859375 mb. Sending TERM to pid 73630.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 465.234375 mb. Sending TERM to pid 73639.
[73688] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 0 (pid: 73688) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 408.94921875 mb. Sending TERM to pid 73655.
[73713] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73713) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 365.1171875 mb. Sending TERM to pid 73639.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 368.0234375 mb. Sending TERM to pid 73713.
[74213] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 74213) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 372.5546875 mb. Sending TERM to pid 73688.
[74245] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 0 (pid: 74245) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 393.4375 mb. Sending TERM to pid 73688.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 430.11328125 mb. Sending TERM to pid 74213.
[74286] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 74286) booted, phase: 0
^C[73624] - Gracefully shutting down workers...
[73624] === puma shutdown: 2018-04-22 12:11:59 +0300 ===
[73624] - Goodbye!
schneems commented 4 years ago

Merged in that PR