zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.34k stars 979 forks source link

System ID mismatch when creating a new cluster #2090

Open falanger opened 2 years ago

falanger commented 2 years ago

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2 Where do you run it - cloud or metal? Kubernetes or OpenShift? Self hosted K8S

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-08-05T16:28:52Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Are you running Postgres Operator in production? Yes Type of issue? Bug report

Hi everyone, We have deployed a postgres-operator to a postgres namespace. And I want to create a new PostgreSQL cluster using this operator:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"acid.zalan.do/v1","kind":"postgresql","metadata":{"annotations":{},"name":"acid-small-01"},"spec":{"allowedSourceRanges":[],"databases":{"graph":"graph"},"enableConnectionPooler":false,"enableMasterLoadBalancer":false,"enableReplicaLoadBalancer":false,"numberOfInstances":4,"postgresql":{"version":"14"},"resources":{"limits":{"cpu":"3500m","memory":"7000Mi"},"requests":{"cpu":"3500m","memory":"7000Mi"}},"teamId":"acid","users":{"graph":["superuser"]},"volume":{"iops":8000,"size":"2000Gi","storageClass":"lowlatency8k","throughput":600}}}
  name: acid-small-01
spec:
  allowedSourceRanges: []
  databases:
    graph: graph
  enableConnectionPooler: false
  enableMasterLoadBalancer: false
  enableReplicaLoadBalancer: false
  numberOfInstances: 4
  postgresql:
    version: "14"
  resources:
    limits:
      cpu: 3500m
      memory: 7000Mi
    requests:
      cpu: 3500m
      memory: 7000Mi
  teamId: acid
  users:
    graph:
    - superuser

After running kubectl apply -f acid-small-01.yaml I see two pods in default namespace:

NAME                                            READY   STATUS    RESTARTS   AGE
acid-small-01-0                                 1/1     Running   0          3m51s
acid-small-01-1                                 0/1     Running   0          3m30s

As you can see, the second pod are not ready. The relevant logs:

2022-10-25 00:33:43,107 INFO: Lock owner: acid-small-01-0; I am acid-small-01-1
2022-10-25 00:33:43,108 INFO: Still starting up as a standby.
2022-10-25 00:33:43,108 CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522
2022-10-25 00:33:43,108 INFO: establishing a new patroni connection to the postgres cluster
2022-10-25 00:33:43,891 INFO: establishing a new patroni connection to the postgres cluster
2022-10-25 00:33:43,893 WARNING: Retry got exception: 'connection problems'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds
2022-10-25 00:34:14,354 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-10-25 00:34:14,426 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-10-25 00:34:14,438 CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 60 seconds
2022-10-25 00:35:14,920 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-10-25 00:35:14,956 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-10-25 00:35:14,966 CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 90 seconds
2022-10-25 00:36:45,825 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-10-25 00:36:45,901 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-10-25 00:36:45,911 CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 120 seconds

So the second pos are stucked after this message: CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522

Steps to fix:

  1. Login to the faulty pod: kubectl exec -i -t acid-small-01-1 -- /bin/bash
  2. Disable auto failover: patronictl pause
  3. Restart patroni service: sv restart patroni
  4. Reinit the member: patronictl reinit acid-small-01 acid-small-01-1
  5. Enable auto failover: patronictl resume
ntcong commented 1 year ago

Do you by any chance have a WAL-E backup activated from an earlier deployment?

anikin-aa commented 1 year ago

Most likely this happens because there are old data files of PostgreSQL for acid-small-01-1 pod.

What type of the storage you are using ?

SNThrailkill commented 1 year ago

Any ideas on how to fix this? Just delete the data in a folder and reinit the node?

jonathon2nd commented 1 year ago

This happens to me too. No matter what I delete, I can not recreate a new cluster with the same name. I have deleted the PV and namespace itself, but no luck.

Is this stored somewhere in the operator that I can delete?

jonathon2nd commented 1 year ago

Ok so my searching lead me to this: https://github.com/zalando/patroni/issues/1744

What I did was delete nodes 1 and 2 and their pvs after initial cluster creation. After that it worked. This is upgraded to 1.9.0

mensylisir commented 1 year ago

the same issue

mensylisir commented 1 year ago

the same issue and have fixed using following steps

Steps to fix:

Login to the faulty pod: kubectl exec -i -t acid-small-01-1 -- /bin/bash
Disable auto failover: patronictl pause
Restart patroni service: sv restart patroni
Reinit the member: patronictl reinit acid-small-01 acid-small-01-1
Enable auto failover: patronictl resume

When will the operator fix it

mensylisir commented 1 year ago

@anikin-aa

the same issue and have fixed using following steps

Login to the faulty pod: kubectl exec -i -t acid-small-01-1 -- /bin/bash
Disable auto failover: patronictl pause
Restart patroni service: sv restart patroni
Reinit the member: patronictl reinit acid-small-01 acid-small-01-1
Enable auto failover: patronictl resume

When will the operator fix it

mensylisir commented 1 year ago

Most likely this happens because there are old data files of PostgreSQL for acid-small-01-1 pod.

What type of the storage you are using ?

It's a new postgres cluster and there are not old data files, but have the same issue

Gerrit-K commented 11 months ago

We had the same issue on a fresh cluster, though, with a slightly different configuration, so this might not apply to everyone here.

After the initial creation, we figured that the PVC was too big, so we needed to downsize it, which required a removal of the PVC. We thought that this would delete all data and let it recreate the cluster from scratch, but we forgot we were using automated WAL backups and didn't think about clearing the backup bucket along with the PVC.

While the leader was bootstrapped just fine (creating a new cluster with a new system ID), it caused issues when the first follower attempted a bootstrap. I assume it was seeing a mixture of two different clusters in the WAL archive, which led to a broken state.

After we eventually realized this, we paused patroni as described above, cleared the bucket, created a clean, new backup from the leader and then reinitialized the follower. After that, everything worked as expected.

Rohmilchkaese commented 7 months ago

This is still an issue for sure