OOM Results in Endless Patroni Restart Loop

apeschel commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/spilo-13:2.0-p7
**Where do you run it - Azure / AWS
Are you running Postgres Operator in production? Yes
Type of issue? Bug report

We are seeing the Spilo pods occasionally end up in a deadlock where Patroni keeps restarting over and over, but failing because Postgres is already running.

This issue is already known on Patroni's side: https://github.com/zalando/patroni/issues/1733

However, it appears that @CyberDem0n decided to close it without resolution due to his belief the user was using Patroni incorrectly.

Unfortunately, it appears normal operations for the postgres-operator are also causing this bug to trigger.

We're not doing anything special for this to occur - it seems to just happen on its own, and it's not clear exactly what the trigger is.

There's only one log entry prior to this occurring:

Dec 14 17:05:17.903 aks-nodepool1-28461340-vmss000001   spilo-13    prod-celanese   /run/service/patroni: finished with code=-1 signal=9

This lines up with what @CyberDem0n mentions in the Patroni issue, that kill -9 on Patroni can trigger this bug. However, no one is manually running kill -9 on our side - this appears to be something either happening from within Spilo or Kubernetes itself.

As I mention in the other issue as well, simply doing a pkill postgresql within the deadlocked pod is enough to fix the issue.

So here's the issues at play here:

What is causing this to happen in postgres-operator?
Is this a problem with the postgres-operator, or with Kubernetes?
How can one prevent this from happening?
Can this be fixed on the postgres-operator side?

apeschel commented 2 years ago

Here's the relevant log lines on the crash. They just match what was already documented in the patroni issue.

2021-12-15 01:05:18,874 INFO: doing crash recovery in a single user mode
2021-12-15 01:05:19,205 ERROR: Crash recovery finished with code=1
2021-12-15 01:05:19,206 INFO: stdout=
2021-12-15 01:05:19,206 INFO: stderr=2021-12-15 01:05:19 UTC [8817]: [1-1] 61b93f4e.2271 0 FATAL: pre-existing shared memory block (key 131086, ID 1) is still in use
2021-12-15 01:05:19 UTC [8817]: [2-1] 61b93f4e.2271 0 HINT: Terminate any old server processes associated with data directory "/home/postgres/pgdata/pgroot/data".

CyberDem0n commented 2 years ago

If Patroni is killed with -9 something is terribly wrong. This is a clear indicator of OOM and my guess would be that you set memory.limits to low, probably something like 100m-200m.

apeschel commented 2 years ago

Screen Shot 2021-12-16 at 10 57 34 AM

It looks like your theory on why the kill -9 is happening is likely correct. The purple line in this image is the memory limit, and the blue line is the used memory.

However, that raises more concerns.

We haven't adjusted the memory limit, 500m appears to be the default used by postgres-operator.

default_memory_limit memory limits for the Postgres containers, unless overridden by cluster-specific settings. The default is 500Mi.

This isn't a particularly large or heavily used database instance. In fact, I would say it is far smaller and much less used than the average database being managed with postgres-operator. It seems the default postgres-operator memory limit is unreasonably low.
This doesn't seem like an ideal outcome from spilo hitting an OOM error. Hitting a patroni restart infinite loop seems like a strictly worse option than letting the spilo container die so Kubernetes can restart it cleanly.

CyberDem0n commented 2 years ago

The Patroni issue you are referring to is actually a different one. There they are killing the postmaster process, not patroni. Maybe in your case OOM kills not only Patroni but also one or many postgres processes. If only Patroni is killed, it is nearly instantly restarted by runit, but according to the Patroni logs postmaster wasn't running.

Locked shared memory (FATAL: pre-existing shared memory block (key 131086, ID 1) is still in use) is an indicator that one or many postgres processes are still alive. But, Patroni will not try to kill them because from the Patroni perspective it is not possible no reliably identify these processes.

This doesn't seem like an ideal outcome from spilo hitting an OOM error

When there is a memory contention one or another process in the container is killed by OOM. There is nothing we can do about it. You need to understand why there is a memory contention and either eliminate it or increase memory requests/limits.

apeschel commented 2 years ago

Maybe in your case OOM kills not only Patroni but also one or many postgres processes.

I tracked the OOM message down in the logs, and can verify this is the case - Patroni and a Postgres process both got OOMKilled.

There is nothing we can do about it. You need to understand why there is a memory contention and either eliminate it or increase memory requests/limits.

We've already increased the memory limit in this case - but there's always going to be a risk of OOM. It would be nice to see a better fall-back than an infinite patroni restart loop for when an OOM does happen. Here's one potential solution:

Update the liveness check on the postgres pods to also check the health of patroni.
Change the container logic to no longer automatically restart patroni when it dies.
The liveness check will restart the container in the pod automatically when patroni gets OOM killed, which will bring it back to a healthy state.

There's other solutions too. This premise doesn't hold with the spilo container for example:

Patroni will not try to kill them because from the Patroni perspective it is not possible no reliably identify these processes.

This is strictly not a valid assumption - the only postgres processes in the spilo container will be those managed by patroni.

CyberDem0n commented 2 years ago

This is strictly not a valid assumption - the only postgres processes in the spilo container will be those managed by patroni.

There is a false assumption in your logic. Patroni doesn't aware of the environment it is running in. On top of that, we need the ability to temporarily stop Patroni and we don't want the pod/container to be restarted because liveness probes are failing.

Here's one potential solution:

Well... Again, you are making an assumption that Patroni OOM and Postgres OOM are closely connected events, while they are not. In most cases it is only Postgres that is killed by OOM, which might cause the behavior you observe.

Patroni being killed with -9 is not a problem at all, because it is instantly restarted by the runit. After that Patroni figures out that Postgres is up and running and everything goes back to normal.

zalando / postgres-operator

OOM Results in Endless Patroni Restart Loop #1725