Open apeschel opened 2 years ago
Here's the relevant log lines on the crash. They just match what was already documented in the patroni issue.
2021-12-15 01:05:18,874 INFO: doing crash recovery in a single user mode
2021-12-15 01:05:19,205 ERROR: Crash recovery finished with code=1
2021-12-15 01:05:19,206 INFO: stdout=
2021-12-15 01:05:19,206 INFO: stderr=2021-12-15 01:05:19 UTC [8817]: [1-1] 61b93f4e.2271 0 FATAL: pre-existing shared memory block (key 131086, ID 1) is still in use
2021-12-15 01:05:19 UTC [8817]: [2-1] 61b93f4e.2271 0 HINT: Terminate any old server processes associated with data directory "/home/postgres/pgdata/pgroot/data".
If Patroni is killed with -9 something is terribly wrong. This is a clear indicator of OOM and my guess would be that you set memory.limits
to low, probably something like 100m-200m.
It looks like your theory on why the kill -9
is happening is likely correct. The purple line in this image is the memory limit, and the blue line is the used memory.
However, that raises more concerns.
default_memory_limit
memory limits for the Postgres containers, unless overridden by cluster-specific settings. The default is 500Mi.
This isn't a particularly large or heavily used database instance. In fact, I would say it is far smaller and much less used than the average database being managed with postgres-operator. It seems the default postgres-operator memory limit is unreasonably low.
This doesn't seem like an ideal outcome from spilo hitting an OOM error. Hitting a patroni restart infinite loop seems like a strictly worse option than letting the spilo container die so Kubernetes can restart it cleanly.
The Patroni issue you are referring to is actually a different one. There they are killing the postmaster process, not patroni.
Maybe in your case OOM kills not only Patroni but also one or many postgres processes.
If only Patroni is killed, it is nearly instantly restarted by runit
, but according to the Patroni logs postmaster wasn't running.
Locked shared memory (FATAL: pre-existing shared memory block (key 131086, ID 1) is still in use
) is an indicator that one or many postgres processes are still alive. But, Patroni will not try to kill them because from the Patroni perspective it is not possible no reliably identify these processes.
This doesn't seem like an ideal outcome from spilo hitting an OOM error
When there is a memory contention one or another process in the container is killed by OOM. There is nothing we can do about it. You need to understand why there is a memory contention and either eliminate it or increase memory requests/limits.
Maybe in your case OOM kills not only Patroni but also one or many postgres processes.
I tracked the OOM message down in the logs, and can verify this is the case - Patroni and a Postgres process both got OOMKilled.
There is nothing we can do about it. You need to understand why there is a memory contention and either eliminate it or increase memory requests/limits.
We've already increased the memory limit in this case - but there's always going to be a risk of OOM. It would be nice to see a better fall-back than an infinite patroni restart loop for when an OOM does happen. Here's one potential solution:
There's other solutions too. This premise doesn't hold with the spilo container for example:
Patroni will not try to kill them because from the Patroni perspective it is not possible no reliably identify these processes.
This is strictly not a valid assumption - the only postgres processes in the spilo container will be those managed by patroni.
This is strictly not a valid assumption - the only postgres processes in the spilo container will be those managed by patroni.
There is a false assumption in your logic. Patroni doesn't aware of the environment it is running in. On top of that, we need the ability to temporarily stop Patroni and we don't want the pod/container to be restarted because liveness probes are failing.
Here's one potential solution:
Well... Again, you are making an assumption that Patroni OOM and Postgres OOM are closely connected events, while they are not. In most cases it is only Postgres that is killed by OOM, which might cause the behavior you observe.
Patroni being killed with -9 is not a problem at all, because it is instantly restarted by the runit
. After that Patroni figures out that Postgres is up and running and everything goes back to normal.
Please, answer some short questions which should help us to understand your problem / question better?
We are seeing the Spilo pods occasionally end up in a deadlock where Patroni keeps restarting over and over, but failing because Postgres is already running.
This issue is already known on Patroni's side: https://github.com/zalando/patroni/issues/1733
However, it appears that @CyberDem0n decided to close it without resolution due to his belief the user was using Patroni incorrectly.
Unfortunately, it appears normal operations for the postgres-operator are also causing this bug to trigger.
We're not doing anything special for this to occur - it seems to just happen on its own, and it's not clear exactly what the trigger is.
There's only one log entry prior to this occurring:
This lines up with what @CyberDem0n mentions in the Patroni issue, that
kill -9
on Patroni can trigger this bug. However, no one is manually runningkill -9
on our side - this appears to be something either happening from within Spilo or Kubernetes itself.As I mention in the other issue as well, simply doing a
pkill postgresql
within the deadlocked pod is enough to fix the issue.So here's the issues at play here: