zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.55k stars 384 forks source link

Stuck at waiting for leader to bootstrap #690

Closed vishrantgupta closed 2 years ago

vishrantgupta commented 2 years ago

This is the image I am using:

repository: registry.opensource.zalan.do/acid/spilo-14
tag: 2.1-p3

and the helm chart is https://github.com/helm/charts/tree/master/incubator/patroni

when I am installing it using helm install patroni-postgres ., the cluster is stuck in the leader election:

 🐢 5947 ~> k logs -f patroni-postgres-0
2022-01-20 04:59:09,905 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-01-20 04:59:11,913 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2022-01-20 04:59:11,915 - bootstrapping - INFO - No meta-data available for this provider
2022-01-20 04:59:11,916 - bootstrapping - INFO - Looks like your running local
2022-01-20 04:59:11,959 - bootstrapping - INFO - Configuring crontab
2022-01-20 04:59:11,960 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-01-20 04:59:11,960 - bootstrapping - INFO - Configuring pgqd
2022-01-20 04:59:11,960 - bootstrapping - INFO - Configuring patroni
2022-01-20 04:59:11,968 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-01-20 04:59:11,969 - bootstrapping - INFO - Configuring wal-e
2022-01-20 04:59:11,969 - bootstrapping - INFO - Configuring pam-oauth2
2022-01-20 04:59:11,969 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping
2022-01-20 04:59:11,969 - bootstrapping - INFO - Configuring certificate
2022-01-20 04:59:11,969 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-01-20 04:59:12,033 - bootstrapping - INFO - Configuring standby-cluster
2022-01-20 04:59:12,033 - bootstrapping - INFO - Configuring log
2022-01-20 04:59:12,033 - bootstrapping - INFO - Configuring bootstrap
2022-01-20 04:59:12,033 - bootstrapping - INFO - Configuring pgbouncer
2022-01-20 04:59:12,033 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-01-20 04:59:12,275 INFO: Selected new K8s API server endpoint https://10.180.13.96:6443
2022-01-20 04:59:12,297 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-01-20 04:59:12,299 INFO: Lock owner: None; I am patroni-postgres-0
2022-01-20 04:59:12,401 INFO: waiting for leader to bootstrap
2022-01-20 04:59:22,803 INFO: Lock owner: None; I am patroni-postgres-0
2022-01-20 04:59:22,804 INFO: waiting for leader to bootstrap
2022-01-20 04:59:32,804 INFO: Lock owner: None; I am patroni-postgres-0
2022-01-20 04:59:32,804 INFO: waiting for leader to bootstrap
2022-01-20 04:59:42,803 INFO: Lock owner: None; I am patroni-postgres-0
2022-01-20 04:59:42,803 INFO: waiting for leader to bootstrap
2022-01-20 04:59:52,802 INFO: Lock owner: None; I am patroni-postgres-0

/home/postgres/pgdata/pgroot/data directory is empty.

These are the running processes:

root@patroni-postgres-0:/home/postgres/pgdata/pgroot/data# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 04:59 ?        00:00:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
root           7       1  0 04:59 ?        00:00:00 /bin/sh /launch.sh
root          32       7  0 04:59 ?        00:00:00 /usr/bin/runsvdir -P /etc/service
root          33      32  0 04:59 ?        00:00:00 runsv pgqd
root          34      32  0 04:59 ?        00:00:00 runsv patroni
postgres      35      33  0 04:59 ?        00:00:00 /bin/bash /scripts/patroni_wait.sh --role master -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
postgres      36      34  0 04:59 ?        00:00:00 /usr/bin/python3 /usr/local/bin/patroni /home/postgres/postgres.yml
root          65       0  0 05:01 pts/0    00:00:00 bash
postgres      97      35  0 05:05 ?        00:00:00 sleep 60
root          98      65  0 05:05 pts/0    00:00:00 ps -ef
CyberDem0n commented 2 years ago

Congrats, you killed your cluster. Patroni relies not only on PGDATA, but also keeps externally (K8s Endpoints or ConfigMaps) the cluster state. Hence it knows that the cluster with such SCOPE already existed and refuses to initialize the new one despite PGDATA being empty. You can verify it by running patronictl list.

Please learn more about Patroni by going though the tutorial: https://github.com/patroni-training/2019

vishrantgupta commented 2 years ago

This was a completely new insallation using HELM chart https://github.com/helm/charts/tree/master/incubator/patroni and the spilio image, why PGDATA should be empty? Where does it get initialized?

patronictl list output

root@patroni-postgres-2:/home/postgres# patronictl list
+ Cluster: patroni-postgres (7053102555993124938) --------+----+-----------+
| Member             | Host           | Role    | State   | TL | Lag in MB |
+--------------------+----------------+---------+---------+----+-----------+
| patroni-postgres-0 | 10.233.90.15   | Replica | stopped |    |   unknown |
| patroni-postgres-1 | 10.233.96.211  | Replica | stopped |    |   unknown |
| patroni-postgres-2 | 10.233.105.110 | Replica | stopped |    |   unknown |
+--------------------+----------------+---------+---------+----+-----------+
CyberDem0n commented 2 years ago

This was a completely new insallation using HELM char

Sorry, but I trust patronictl list output, which clearly tells that cluster with the name patroni-postgres already existed, and I even can tell when it was created: Fri Jan 14 17:40:07 2022

vishrantgupta commented 2 years ago

I did create the cluster with the name patroni-postgres on Fri Jan 14 17:40:07 2022 but I deleted the helm deployment using helm uninstall patroni-postgres followed by deleting the PVC and PV (LocalVolume), I thought the Postgres has been completely removed but looks like that is not true.

Then I was deploying the helm chart with the same name patroni-postgres, however deploying the chart with a different name fixes the issue.

I also tried patronictl remove patroni-postgres but the outcome is same:

root@patroni-postgres-2:/home/postgres# patronictl remove patroni-postgres
+ Cluster: patroni-postgres (7053102555993124938) --------+----+-----------+
| Member             | Host           | Role    | State   | TL | Lag in MB |
+--------------------+----------------+---------+---------+----+-----------+
| patroni-postgres-0 | 10.233.90.15   | Replica | stopped |    |   unknown |
| patroni-postgres-1 | 10.233.96.211  | Replica | stopped |    |   unknown |
| patroni-postgres-2 | 10.233.105.110 | Replica | stopped |    |   unknown |
+--------------------+----------------+---------+---------+----+-----------+
Please confirm the cluster name to remove: patroni-postgres
You are about to remove all information in DCS for patroni-postgres, please type: "Yes I am aware": Yes I am aware
root@patroni-postgres-2:/home/postgres# patronictl list
+ Cluster: patroni-postgres (uninitialized) ----+---------+----+-----------+
| Member             | Host           | Role    | State   | TL | Lag in MB |
+--------------------+----------------+---------+---------+----+-----------+
| patroni-postgres-0 | 10.233.90.15   | Replica | stopped |    |   unknown |
| patroni-postgres-1 | 10.233.96.211  | Replica | stopped |    |   unknown |
| patroni-postgres-2 | 10.233.105.110 | Replica | stopped |    |   unknown |
+--------------------+----------------+---------+---------+----+-----------+
CyberDem0n commented 2 years ago

Patroni relies not only on PGDATA, but also keeps externally (K8s Endpoints or ConfigMaps) the cluster state. These objects are not created by helm chart and hence they are not removed either when you delete everything else.