A cordon of a node triggers a db failover

sabinematthys commented 3 years ago

Image: postgres-operator:v1.6.0
Run on Bare Metal K8s
Postgres Operator is in Production
Type of issue? Not sure yet, depends on the answer [Bug report, question, feature request, etc.]

With spilo 13 patroni 2.0 and postgres-operator 1.6.0 I noticed that when doing a cordon of master it triggers a db failover Before cordon we have

| Member                   | Host          | Role    | State   | TL | Lag in MB |
+--------------------------+---------------+---------+---------+----+-----------+
| db-cluster-0 | ip1 | Replica | running | 35 |         0 |
| db-cluster-1 | ip2 | Leader  | running | 35 |           |
| db-cluster-2 | ip3 | Replica | running | 35 |         0 |
+--------------------------+---------------+---------+---------+----+-----------+

then I do kubectl cordon $(kubectl get pods -l application=spilo -l spilo-role=master -o=jsonpath='{.items[0].spec.nodeName}') And after I see:

just after:

| Member                   | Host          | Role    | State   | TL | Lag in MB |
+--------------------------+---------------+---------+---------+----+-----------+
| db-cluster-0 | ip1 | Leader  | running | 36 |           |
| db-cluster-1 | ip2 | Replica | stopped |    |   unknown |
| db-cluster-2 | ip3 | Replica | running |    |   unknown |
+--------------------------+---------------+---------+---------+----+-----------+

a bit after:

| Member                   | Host          | Role    | State   | TL | Lag in MB |
+--------------------------+---------------+---------+---------+----+-----------+
| db-cluster-0 | ip1 | Leader  | running | 36 |           |
| db-cluster-1 | ip2 | Replica | stopped |    |   unknown |
| db-cluster-2 | ip3 | Replica | running | 36 |         0 |
+--------------------------+---------------+---------+---------+----+-----------+

after stabilisation:

| Member                   | Host          | Role    | State   | TL | Lag in MB |
+--------------------------+---------------+---------+---------+----+-----------+
| db-cluster-0 | ip1 | Leader  | running | 36 |           |
| db-cluster-1 | ip2 | Replica | running | 36 |         0 |
| db-cluster-2 | ip3 | Replica | running | 36 |         0 |
+--------------------------+---------------+---------+---------+----+-----------+

So we see that just with a cordon the leader is on ip1 while it was on ip2 before. This behavior affects our testing strategy (involving cordon of nodes to schedule pods on other nodes than master for instance) that was developed with postgres-operator 1.5.0, spilo 12 and patroni 1.6.5 also we are not sure if it could have impact on upgrade of our application. Is there a way to deactivate the behavior?

Postgres-operator pod log

time="2021-03-09T16:53:47Z" level=info msg="moving pods: node \"/cluster-c\" became unschedulable and does not have a ready label: map[]" pkg=controller
time="2021-03-09T16:53:47Z" level=info msg="starting process to migrate master pod \"db-cluster-0\"" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=debug msg="Waiting for any replica pod to become ready" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=debug msg="Found 2 running replica pods" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=info msg="check failed: pod \"db-cluster-1\" is already on a live node" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=debug msg="switching over from \"db-cluster-0\" to \"db-cluster-1\"" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=debug msg="making POST http request: http://<ip>:8008/failover" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:47Z" level=debug msg="subscribing to pod \"db-cluster-1\"" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:48Z" level=debug msg="successfully switched over from \"db-cluster-0\" to \"db-cluster-1\"" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:48Z" level=debug msg="unsubscribing from pod \"db-cluster-1\" events" cluster-name=db-cluster pkg=cluster worker=0
time="2021-03-09T16:53:48Z" level=info msg="1/1 master pods have been moved out from the \"/cluster-c\" node" pkg=controller

master pod log (another test, for better log follow up, this time db-cluster-1 was master before cordon):

2021-03-09 17:12:20,284 INFO: Lock owner: db-cluster-1; I am db-cluster-1
2021-03-09 17:12:20,382 INFO: no action.  i am the leader with the lock
2021-03-09 17:12:29,729 INFO: received failover request with leader=db-cluster-1 candidate=db-cluster-0 scheduled_at=None
2021-03-09 17:12:29,745 INFO: Got response from db-cluster-0 http://<ipxxx>:8008/patroni: {"state": "running", "postmaster_start_time": "2021-03-09 16:53:51.014 UTC", "role": "replica", "server_version": 120006, "cluster_unlocked": false, "xlog": {"received_location": 10114689640, "replayed_location": 10114689640, "replayed_timestamp": "2021-03-09 17:12:29.584 UTC", "paused": false}, "timeline": 37, "database_system_identifier": "6934990847317057616", "patroni": {"version": "2.0.2", "scope": "db-cluster"}}
2021-03-09 17:12:29,879 INFO: Lock owner: db-cluster-1; I am db-cluster-1
2021-03-09 17:12:29,931 INFO: Got response from db-cluster-0 http://<ipxxx>:8008/patroni: {"state": "running", "postmaster_start_time": "2021-03-09 16:53:51.014 UTC", "role": "replica", "server_version": 120006, "cluster_unlocked": false, "xlog": {"received_location": 10114691432, "replayed_location": 10114691432, "replayed_timestamp": "2021-03-09 17:12:29.876 UTC", "paused": false}, "timeline": 37, "database_system_identifier": "6934990847317057616", "patroni": {"version": "2.0.2", "scope": "db-cluster"}}
2021-03-09 17:12:30,078 INFO: manual failover: demoting myself
2021-03-09 17:12:30,626 INFO: Leader key released
2021-03-09 17:12:30,627 INFO: Lock owner: db-cluster-0; I am db-cluster-1
2021-03-09 17:12:30,628 INFO: manual failover: demote in progress
2021-03-09 17:12:30,771 INFO: Lock owner: db-cluster-0; I am db-cluster-1
2021-03-09 17:12:30,771 INFO: manual failover: demote in progress
2021-03-09 17:12:31,837 INFO: Lock owner: db-cluster-0; I am db-cluster-1
2021-03-09 17:12:31,838 INFO: manual failover: demote in progress
2021-03-09 17:12:31,987 INFO: Lock owner: db-cluster-0; I am db-cluster-1
2021-03-09 17:12:31,987 INFO: manual failover: demote in progress
2021-03-09 17:12:32,633 INFO: Local timeline=37 lsn=2/5B000028
2021-03-09 17:12:32,644 INFO: master_timeline=38
2021-03-09 17:12:32,648 INFO: master: history=34        2/570000A0      no recovery target specified
35      2/580000A0      no recovery target specified
36      2/5A0000A0      no recovery target specified
37      2/5B0000A0      no recovery target specified
2021-03-09 17:12:32,649 INFO: closed patroni connection to the postgresql cluster
2021-03-09 17:12:33 UTC [32193]: [1-1] 6047ac80.7dc1 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-03-09 17:12:33 UTC [32193]: [2-1] 6047ac80.7dc1 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2021-03-09 17:12:33 UTC [32193]: [3-1] 6047ac80.7dc1 0     LOG:  starting PostgreSQL 12.6 (Ubuntu 12.6-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
2021-03-09 17:12:33 UTC [32193]: [4-1] 6047ac80.7dc1 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-03-09 17:12:33 UTC [32193]: [5-1] 6047ac80.7dc1 0     LOG:  listening on IPv6 address "::", port 5432
2021-03-09 17:12:33 UTC [32193]: [6-1] 6047ac80.7dc1 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-03-09 17:12:33,012 INFO: postmaster pid=32193
2021-03-09 17:12:33 UTC [32193]: [7-1] 6047ac80.7dc1 0     LOG:  redirecting log output to logging collector process
2021-03-09 17:12:33 UTC [32193]: [8-1] 6047ac80.7dc1 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - accepting connections
2021-03-09 17:12:42,489 INFO: Lock owner: db-cluster-0; I am db-cluster-1
2021-03-09 17:12:42,489 INFO: does not have lock

It seems to be a new feature (between 1.5 and 1.6) which I don't really understand as un-schedulable means new pods cannot be scheduled, why does it mean status of current ones has to be changed ? In which case is that behavior interesting? (just to understand why it was implemented) In our case, we use the cordon to prevent pods to be scheduled on master (for tests mainly), but that is just one among several ways we use cordon on master node, how can we do that if master switches from node as soon as it is cordoned? So is there a way to disable that feature (we didn't have that problem in previous version)?

Samusername commented 3 years ago

Additional comment. Yes, there are cordon, and drain, operations.

kubectl cordon my-node # Mark my-node as unschedulable kubectl drain my-node # Drain my-node in preparation for maintenance Link.

I noticed the same: when a k8s node is cordoned, then pod gets moved to an other k8s node, in the DB cluster. --> It is incorrect behaviour. We may cordon healthy k8s nodes. --> Pods should not be moved to other k8s nodes in such situation. I would like to keep DB pods on same k8s nodes as long as possible --> Because of stability reasons of the DB cluster. "Drain" is meant for moving pods away from a k8s node.

sabinematthys commented 3 years ago

Indeed the cordon is not the only operation that can lead to db failover (any operation which leads to not "Ready" status triggers it. But in the case of the drain, I think even without postgres-operator action it would failover as the pod is also removed. It is only in the case of cordon that it is annoying for us.

machine424 commented 3 years ago

@sabinematthys, podAntiAffinity https://github.com/zalando/postgres-operator/blob/v1.6.1/docs/administrator.md#enable-pod-anti-affinity may help you spread your Pods.

But I kind of agree with you, we may sometimes cordon a "healthy" node for various eligible reasons and we don't want the node being Unschedulable https://github.com/zalando/postgres-operator/blob/v1.6.1/pkg/controller/node.go#L79 driving the operator to move Pods from it.

Samusername commented 3 years ago

cordon case, (of a healthy k8s node or even multiple k8s nodes):

Is it ok, if the behaviour is changed or made configurable? pod should not be moved in such case.

https://github.com/zalando/postgres-operator/blob/v1.6.1/pkg/controller/node.go#L74 https://github.com/zalando/postgres-operator/blob/v1.6.1/pkg/controller/node.go#L79

Or is there a reason for the functionality? (Drain related?)

sabinematthys commented 3 years ago

To who are you asking the question? I suppose not to me as it was indeed the initial request I did, but good for me.

Samusername commented 3 years ago

I just realized: If only one k8s node is cordoned, then only the Leader role is transferred to an other pod/db node. Pods are not moved in such case.

A pod would be moved, if e.g. all k8s nodes of all db nodes are cordoned.

desaintmartin commented 10 months ago

This still seems to be an issue, the only solution we found at the time (thanks @machine424) was to forbid the operator to get/list/watch nodes, without issues, despite annoying error logs.

The solution listed at https://github.com/zalando/postgres-operator/issues/2277#issuecomment-1507039074 (thanks @FxKu) does not seem to be generic enough for all use cases (at least, for our use case, where we don't have specific readiness labels).

zalando / postgres-operator

A cordon of a node triggers a db failover #1400