Closed nritholtz closed 4 years ago
I've not tested running both OOM and rolling restarts together. I would believe there's a bug here.
It happens with only Rolling restart as well
Look at the PIDs to which TERM is sent
[15908] PumaWorkerKiller: Consuming 3078.51953125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.51953125 mb. Sending TERM to pid 15916.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.53125 mb. Sending TERM to pid 15913.
[15908] PumaWorkerKiller: Consuming 3078.56640625 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.56640625 mb with master and 2 workers.
The PIDs of the workers are same
deploy 15908 0.0 3.6 2120036 277748 ? Sl 04:26 0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]
deploy 15913 0.6 20.2 3504200 1554988 ? Sl 04:26 4:57 puma: cluster worker 0: 15908 [20170801151810]
deploy 15916 0.6 17.2 3189164 1319880 ? Sl 04:26 4:35 puma: cluster worker 1: 15908 [20170801151810]
A bit of trivia: PIDs can be recycled by the OS, so even if one is killed a new one may have the exact same PID number.
That's not what I think is happening though.
I'm looking at this example:
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.51953125 mb. Sending TERM to pid 15916.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Consuming 3078.53125 mb with master and 2 workers.
[15908] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 3078.53125 mb. Sending TERM to pid 15913.
Those are two different numbers 15913
and 15916
.
I'm not sure where the other numbers in the second code paste are coming from. What am I looking at here:
deploy 15908 0.0 3.6 2120036 277748 ? Sl 04:26 0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]
deploy 15913 0.6 20.2 3504200 1554988 ? Sl 04:26 4:57 puma: cluster worker 0: 15908 [20170801151810]
deploy 15916 0.6 17.2 3189164 1319880 ? Sl 04:26 4:35 puma: cluster worker 1: 15908 [20170801151810]
User PID
deploy 15908 0.0 3.6 2120036 277748 ? Sl 04:26 0:04 puma 3.8.2 (unix:///var/www/fbdash/shared/tmp/sockets/puma.sock) [20170801151810]
deploy 15913 0.6 20.2 3504200 1554988 ? Sl 04:26 4:57 puma: cluster worker 0: 15908 [20170801151810]
deploy 15916 0.6 17.2 3189164 1319880 ? Sl 04:26 4:35 puma: cluster worker 1: 15908 [20170801151810]
👆 This is the output of ps aux | grep puma
after Rolling Restart was done.
I have experienced this as well. Unfortunately didn't get a screenshot. I will attempt to get another.
Here's an example of PWK trying to kill off a PID that appears to already have been attempted to be killed.
16:07:32 web.1 | [78523] * Listening on tcp://0.0.0.0:5000
16:07:32 web.1 | [78523] Use Ctrl-C to stop
16:07:32 web.1 | [78523] - Worker 0 (pid: 78544) booted, phase: 0
16:07:32 web.1 | [78523] - Worker 1 (pid: 78545) booted, phase: 0
16:07:49 web.1 | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 138.421875 mb out of max: mb. Sending TERM to pid 78545.
16:07:49 web.1 | [78523] - Worker 1 (pid: 78627) booted, phase: 0
16:08:49 web.1 | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 140.87109375 mb out of max: mb. Sending TERM to pid 78544.
16:08:49 web.1 | [78523] - Worker 0 (pid: 78896) booted, phase: 0
16:10:07 web.1 | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 142.6328125 mb out of max: mb. Sending TERM to pid 78627.
16:10:07 web.1 | [78523] - Worker 1 (pid: 79303) booted, phase: 0
16:11:07 web.1 | [78523] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 141.8203125 mb out of max: mb. Sending TERM to pid 78544.
Another report of the same issue.
It looks like that the easiest way to repro this one is to have 1 worker and set the restart interval to something like 10s.
Let me know if there is anything I can help with.
#!/usr/bin/env puma
environment 'development'
threads 3,3
workers 1
worker_timeout 30
prune_bundler
before_fork do
require 'puma_worker_killer'
PumaWorkerKiller.enable_rolling_restart(30)
end
One worker:
* Pruning Bundler environment
[72542] Puma starting in cluster mode...
[72542] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[72542] * Min threads: 3, max threads: 3
[72542] * Environment: development
[72542] * Process workers: 1
[72542] * Phased restart available
[72542] * Listening on tcp://0.0.0.0:9292
[72542] Use Ctrl-C to stop
[72549] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72549) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.38671875 mb. Sending TERM to pid 72549.
[72560] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72560) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 189.171875 mb. Sending TERM to pid 72549.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 189.671875 mb. Sending TERM to pid 72560.
[72602] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 72602) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 181.3046875 mb. Sending TERM to pid 72560.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 152.2109375 mb. Sending TERM to pid 72602.
[73130] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 73130) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.41015625 mb. Sending TERM to pid 72602.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 152.9453125 mb. Sending TERM to pid 73130.
[73195] + Gemfile in context: /Users/user/code/project/Gemfile
[72542] - Worker 0 (pid: 73195) booted, phase: 0
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 178.2421875 mb. Sending TERM to pid 73130.
[72542] PumaWorkerKiller: Rolling Restart. 1 workers consuming total: 179.00390625 mb. Sending TERM to pid 73195.
[73249] + Gemfile in context: /Users/user/code/project/Gemfile
^C[72542] - Gracefully shutting down workers...
[73249] Early termination of worker
[72542] === puma shutdown: 2018-04-22 11:46:59 +0300 ===
[72542] - Goodbye!
Two workers:
#!/usr/bin/env puma
environment 'development'
threads 3,3
workers 2
worker_timeout 30
prune_bundler
before_fork do
require 'puma_worker_killer'
PumaWorkerKiller.enable_rolling_restart(30)
end
* Pruning Bundler environment
[73262] Puma starting in cluster mode...
[73262] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[73262] * Min threads: 3, max threads: 3
[73262] * Environment: development
[73262] * Process workers: 2
[73262] * Phased restart available
[73262] * Listening on tcp://0.0.0.0:9292
[73262] Use Ctrl-C to stop
[73357] + Gemfile in context: /Users/user/code/project/Gemfile
[73358] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73357) booted, phase: 0
[73262] - Worker 1 (pid: 73358) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 371.06640625 mb. Sending TERM to pid 73358.
[73380] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73380) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 359.234375 mb. Sending TERM to pid 73357.
[73409] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73409) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 390.625 mb. Sending TERM to pid 73357.
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 405.01171875 mb. Sending TERM to pid 73380.
[73472] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73472) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 392.73046875 mb. Sending TERM to pid 73380.
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 392.875 mb. Sending TERM to pid 73409.
[73528] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 0 (pid: 73528) booted, phase: 0
[73262] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 360.69921875 mb. Sending TERM to pid 73472.
[73559] + Gemfile in context: /Users/user/code/project/Gemfile
[73262] - Worker 1 (pid: 73559) booted, phase: 0
^C[73262] - Gracefully shutting down workers...
[73262] === puma shutdown: 2018-04-22 11:56:56 +0300 ===
[73262] - Goodbye!
* Pruning Bundler environment
[73624] Puma starting in cluster mode...
[73624] * Version 3.11.3 (ruby 2.3.3-p222), codename: Love Song
[73624] * Min threads: 3, max threads: 3
[73624] * Environment: development
[73624] * Process workers: 2
[73624] * Phased restart available
[73624] * Listening on tcp://0.0.0.0:9292
[73624] Use Ctrl-C to stop
[73629] + Gemfile in context: /Users/user/code/project/Gemfile
[73630] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 220.21875 mb. Sending TERM to pid 73629.
[73629] Early termination of worker
[73639] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73630) booted, phase: 0
[73624] - Worker 0 (pid: 73639) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 364.67578125 mb. Sending TERM to pid 73630.
[73655] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73655) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 402.80859375 mb. Sending TERM to pid 73630.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 465.234375 mb. Sending TERM to pid 73639.
[73688] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 0 (pid: 73688) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 408.94921875 mb. Sending TERM to pid 73655.
[73713] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 73713) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 365.1171875 mb. Sending TERM to pid 73639.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 368.0234375 mb. Sending TERM to pid 73713.
[74213] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 74213) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 372.5546875 mb. Sending TERM to pid 73688.
[74245] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 0 (pid: 74245) booted, phase: 0
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 393.4375 mb. Sending TERM to pid 73688.
[73624] PumaWorkerKiller: Rolling Restart. 2 workers consuming total: 430.11328125 mb. Sending TERM to pid 74213.
[74286] + Gemfile in context: /Users/user/code/project/Gemfile
[73624] - Worker 1 (pid: 74286) booted, phase: 0
^C[73624] - Gracefully shutting down workers...
[73624] === puma shutdown: 2018-04-22 12:11:59 +0300 ===
[73624] - Goodbye!
Merged in that PR
Here is the related puma_worker_killer section from our puma config:
Just for informational sake, I am running puma with 2 workers with min 1 max 5 threads.
I found that the Rolling Restart feature was sending TERM to a pid that was already killed due to Out of memory, instead of the newly spawned pid.
Here is the related excerpt from our logs:
Here is the related kill/boot-up excerpt:
So from what I see, the Rolling Restart should have actually killed pid 3940 instead of 3047.