zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.22k stars 968 forks source link

Ignoring some fields' differences to avoid uneeded pods switchover #2436

Open j4m3s-s opened 11 months ago

j4m3s-s commented 11 months ago

Please, answer some short questions which should help us to understand your problem / question better?

Every 10/30 minutes my DB pods are restarted which make the DB unavailable for a couple seconds. I think I've traced the issue to the way the operator compares the pod in the cluster and what it expects to find. I have a software running in my cluster that patches pods images to redirect it to a caching proxy. (say docker.io/library/toto -> mycachingproxy/docker.io/library/toto) which I think causes this.

So my question is : is my analysis correct ? I'd be glad to post a MR to fix this issue if that's the case.

Thanks by advance.

Logs from the operator :

postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:00:31Z" level=info msg="cluster has been synced" cluster-name=systems-services-harbor/harbor-db pkg=controller worker=0
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=info msg="SYNC event has been queued" cluster-name=systems-services-harbor/harbor-db pkg=controller worker=0
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=info msg="there are 1 clusters running" pkg=controller
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=info msg="syncing of the cluster started" cluster-name=systems-services-harbor/harbor-db pkg=controller worker=0
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=debug msg="team API is disabled" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=debug msg="team API is disabled" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=info msg="syncing secrets" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=debug msg="syncing master service" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:53Z" level=debug msg="syncing replica service" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:54Z" level=debug msg="syncing volumes using \"pvc\" storage resize mode" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:54Z" level=info msg="volume claims do not require changes" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:54Z" level=debug msg="syncing statefulsets" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:54Z" level=debug msg="mark rolling update annotation for harbor-db-0: reason pod not yet restarted due to lazy update" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:54Z" level=debug msg="mark rolling update annotation for harbor-db-1: reason pod not yet restarted due to lazy update" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="syncing Patroni config" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="making GET http request: http://10.233.69.103:8008/config" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="making GET http request: http://10.233.67.21:8008/config" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="making GET http request: http://10.233.69.103:8008/patroni" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="making GET http request: http://10.233.67.21:8008/patroni" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="performing rolling update" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:29:55Z" level=debug msg="subscribing to pod \"systems-services-harbor/harbor-db-0\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=info msg="pod \"systems-services-harbor/harbor-db-0\" has been recreated" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=debug msg="unsubscribing from pod \"systems-services-harbor/harbor-db-0\" events" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=debug msg="making GET http request: http://10.233.67.21:8008/cluster" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=debug msg="switching over from \"harbor-db-1\" to \"systems-services-harbor/harbor-db-0\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=debug msg="subscribing to pod \"systems-services-harbor/harbor-db-0\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:11Z" level=debug msg="making POST http request: http://10.233.67.21:8008/failover" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:13Z" level=debug msg="successfully switched over from \"harbor-db-1\" to \"systems-services-harbor/harbor-db-0\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:14Z" level=debug msg="unsubscribing from pod \"systems-services-harbor/harbor-db-0\" events" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:14Z" level=info msg="recreating old master pod \"systems-services-harbor/harbor-db-1\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:14Z" level=debug msg="subscribing to pod \"systems-services-harbor/harbor-db-1\"" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=info msg="pod \"systems-services-harbor/harbor-db-1\" has been recreated" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="unsubscribing from pod \"systems-services-harbor/harbor-db-1\" events" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="syncing pod disruption budgets" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="syncing roles" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="closing database connection" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="syncing databases" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="closing database connection" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="syncing prepared databases with schemas" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="syncing connection pooler (master, replica) from (true, nil) to (true, nil)" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=systems-services-harbor/harbor-db pkg=cluster
postgres-operator-56fb8b4bdb-zp7nw postgres-operator time="2023-10-03T12:30:30Z" level=info msg="cluster has been synced" cluster-name=systems-services-harbor/harbor-db pkg=controller worker=0
j4m3s-s commented 11 months ago

Relevant master logs :

harbor-db-0 postgres 2023-10-03 12:59:40.591 UTC [25] LOG {ticks: 0, maint: 0, retry: 0}
harbor-db-0 postgres 2023-10-03 12:59:44,406 INFO: no action. I am (harbor-db-0), the leader with the lock
harbor-db-0 postgres 2023-10-03 12:59:54,406 INFO: no action. I am (harbor-db-0), the leader with the lock
harbor-db-0 postgres 2023-10-03 13:00:04,465 INFO: no action. I am (harbor-db-0), the leader with the lock
harbor-db-0 postgres 2023-10-03 13:00:10.593 UTC [25] LOG {ticks: 0, maint: 0, retry: 0}
harbor-db-0 postgres 2023-10-03 13:00:12,521 INFO: received failover request with leader=harbor-db-0 candidate=harbor-db-1 scheduled_at=None
harbor-db-0 postgres 2023-10-03 13:00:12,531 INFO: Got response from harbor-db-1 http://10.233.67.152:8008/patroni: {"state": "running", "postmaster_start_time": "2023-10-03 13:00:11.433821+00:00", "role": "replica", "server_version": 130010, "xlog": {"received_location": 20401094656, "replayed_location": 20401094656, "replayed_timestamp": "2023-10-03 13:00:03.787290+00:00", "paused": false}, "timeline": 990, "dcs_last_seen": 1696338012, "database_system_identifier": "7276438021957627967", "patroni": {"version": "3.0.1", "scope": "harbor-db"}}
harbor-db-0 postgres 2023-10-03 13:00:12,630 INFO: Got response from harbor-db-1 http://10.233.67.152:8008/patroni: {"state": "running", "postmaster_start_time": "2023-10-03 13:00:11.433821+00:00", "role": "replica", "server_version": 130010, "xlog": {"received_location": 20401094656, "replayed_location": 20401094656, "replayed_timestamp": "2023-10-03 13:00:03.787290+00:00", "paused": false}, "timeline": 990, "dcs_last_seen": 1696338012, "database_system_identifier": "7276438021957627967", "patroni": {"version": "3.0.1", "scope": "harbor-db"}}
harbor-db-0 postgres 2023-10-03 13:00:12,578 INFO: Lock owner: harbor-db-0; I am harbor-db-0
harbor-db-0 postgres 2023-10-03 13:00:12,684 INFO: manual failover: demoting myself
harbor-db-0 postgres 2023-10-03 13:00:12,685 INFO: Demoting self (graceful)
harbor-db-0 postgres 2023-10-03 13:00:13,948 INFO: Leader key released
harbor-db-0 postgres 2023-10-03 13:00:13,952 INFO: Lock owner: harbor-db-1; I am harbor-db-0
harbor-db-0 postgres 2023-10-03 13:00:13,952 INFO: manual failover: demote in progress
harbor-db-0 postgres 2023-10-03 13:00:13,953 INFO: Lock owner: harbor-db-1; I am harbor-db-0
harbor-db-0 postgres 2023-10-03 13:00:13,953 INFO: manual failover: demote in progress
harbor-db-0 postgres 2023-10-03 13:00:14,096 INFO: Lock owner: harbor-db-1; I am harbor-db-0
harbor-db-0 postgres 2023-10-03 13:00:14,096 INFO: manual failover: demote in progress
harbor-db-0 postgres 2023-10-03 13:00:15,105 INFO: Lock owner: harbor-db-1; I am harbor-db-0
harbor-db-0 postgres 2023-10-03 13:00:15,105 INFO: manual failover: demote in progress
harbor-db-0 postgres 2023-10-03 13:00:16 UTC [740]: [1-1] 651c1060.2e4 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
harbor-db-0 postgres 2023-10-03 13:00:16 UTC [740]: [2-1] 651c1060.2e4 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
harbor-db-0 postgres 2023-10-03 13:00:16 UTC [740]: [3-1] 651c1060.2e4 0     LOG:  redirecting log output to logging collector process
harbor-db-0 postgres 2023-10-03 13:00:16 UTC [740]: [4-1] 651c1060.2e4 0     HINT:  Future log output will appear in directory "../pg_log".
harbor-db-0 postgres /var/run/postgresql:5432 - no response
harbor-db-0 postgres /var/run/postgresql:5432 - accepting connections
harbor-db-0 postgres /var/run/postgresql:5432 - accepting connections
harbor-db-0 postgres /etc/runit/runsvdir/default/patroni: finished with code=0 signal=0
harbor-db-0 postgres stopping /etc/runit/runsvdir/default/patroni
harbor-db-0 postgres timeout: finish: .: (pid 756) 1815s, want down
harbor-db-0 postgres ok: down: patroni: 0s, normally up
harbor-db-0 postgres ok: down: /etc/service/patroni: 0s, normally up
harbor-db-0 postgres 2023-10-03 13:00:24.881 UTC [25] LOG Got SIGTERM, fast exit
harbor-db-0 postgres ok: down: /etc/service/pgqd: 1s, normally up
JuanRamino commented 3 months ago

Hi there,

Same problem for me. I have a mutating webhook which changes the image of the container in the pod. The operator wants to sync the pod to the original state rescheduling it every 30m. Is there a way to avoid this situation?

Thanks

janpipan commented 3 weeks ago

This can be solved by changing either the default image of your pods in operatorconfigurations CRD or in your postgresql CRD. The mutation webhook should then ignore resources that already have the correct image registry configured. Also keep in mind that you have to change image name of every sidecar container as well.