Open falanger opened 2 years ago
Do you by any chance have a WAL-E backup activated from an earlier deployment?
Most likely this happens because there are old data files of PostgreSQL for acid-small-01-1
pod.
What type of the storage you are using ?
Any ideas on how to fix this? Just delete the data in a folder and reinit the node?
This happens to me too. No matter what I delete, I can not recreate a new cluster with the same name. I have deleted the PV and namespace itself, but no luck.
Is this stored somewhere in the operator that I can delete?
Ok so my searching lead me to this: https://github.com/zalando/patroni/issues/1744
What I did was delete nodes 1 and 2 and their pvs after initial cluster creation. After that it worked. This is upgraded to 1.9.0
the same issue
the same issue and have fixed using following steps
Steps to fix:
Login to the faulty pod: kubectl exec -i -t acid-small-01-1 -- /bin/bash
Disable auto failover: patronictl pause
Restart patroni service: sv restart patroni
Reinit the member: patronictl reinit acid-small-01 acid-small-01-1
Enable auto failover: patronictl resume
When will the operator fix it
@anikin-aa
the same issue and have fixed using following steps
Login to the faulty pod: kubectl exec -i -t acid-small-01-1 -- /bin/bash
Disable auto failover: patronictl pause
Restart patroni service: sv restart patroni
Reinit the member: patronictl reinit acid-small-01 acid-small-01-1
Enable auto failover: patronictl resume
When will the operator fix it
Most likely this happens because there are old data files of PostgreSQL for
acid-small-01-1
pod.What type of the storage you are using ?
It's a new postgres cluster and there are not old data files, but have the same issue
We had the same issue on a fresh cluster, though, with a slightly different configuration, so this might not apply to everyone here.
After the initial creation, we figured that the PVC was too big, so we needed to downsize it, which required a removal of the PVC. We thought that this would delete all data and let it recreate the cluster from scratch, but we forgot we were using automated WAL backups and didn't think about clearing the backup bucket along with the PVC.
While the leader was bootstrapped just fine (creating a new cluster with a new system ID), it caused issues when the first follower attempted a bootstrap. I assume it was seeing a mixture of two different clusters in the WAL archive, which led to a broken state.
After we eventually realized this, we paused patroni as described above, cleared the bucket, created a clean, new backup from the leader and then reinitialized the follower. After that, everything worked as expected.
This is still an issue for sure
Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2 Where do you run it - cloud or metal? Kubernetes or OpenShift? Self hosted K8S
Are you running Postgres Operator in production? Yes Type of issue? Bug report
Hi everyone, We have deployed a postgres-operator to a
postgres
namespace. And I want to create a new PostgreSQL cluster using this operator:After running
kubectl apply -f acid-small-01.yaml
I see two pods indefault
namespace:As you can see, the second pod are not ready. The relevant logs:
So the second pos are stucked after this message:
CRITICAL: system ID mismatch, node acid-small-01-1 belongs to a different cluster: 7158241270150770770 != 7157772675274813522
Steps to fix:
kubectl exec -i -t acid-small-01-1 -- /bin/bash
patronictl pause
sv restart patroni
patronictl reinit acid-small-01 acid-small-01-1
patronictl resume