Cluster failed to progress beyond "initialization" after initialization and corruption

stevefan1999-personal commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.6.0 registry.opensource.zalan.do/acid/postgres-operator:v1.6.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] k3s
Are you running Postgres Operator in production? [yes | no] not really
Type of issue? [Bug report, question, feature request, etc.] bug report

Steps to reproduce:

Create a cluster
Have a catastrophic event that potentially corrupts Patroni data volume (mine was OpenEBS volume attaching crashing cascades)
Try to restart the cluster

See a lot of these lines in the log

2021-01-19 18:19:19,751 INFO: waiting for leader to bootstrap
2021-01-19 18:19:29,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:19:29,751 INFO: waiting for leader to bootstrap
2021-01-19 18:19:39,751 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:19:39,751 INFO: waiting for leader to bootstrap
2021-01-19 18:19:49,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:19:49,750 INFO: waiting for leader to bootstrap
2021-01-19 18:19:59,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:19:59,750 INFO: waiting for leader to bootstrap
2021-01-19 18:20:09,751 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:09,751 INFO: waiting for leader to bootstrap
2021-01-19 18:20:19,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:19,751 INFO: waiting for leader to bootstrap
2021-01-19 18:20:29,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:29,751 INFO: waiting for leader to bootstrap
2021-01-19 18:20:39,751 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:39,751 INFO: waiting for leader to bootstrap
2021-01-19 18:20:49,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:49,750 INFO: waiting for leader to bootstrap
2021-01-19 18:20:59,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:20:59,750 INFO: waiting for leader to bootstrap
2021-01-19 18:21:09,751 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:09,751 INFO: waiting for leader to bootstrap
2021-01-19 18:21:19,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:19,751 INFO: waiting for leader to bootstrap
2021-01-19 18:21:29,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:29,750 INFO: waiting for leader to bootstrap
2021-01-19 18:21:39,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:39,751 INFO: waiting for leader to bootstrap
2021-01-19 18:21:49,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:49,751 INFO: waiting for leader to bootstrap
2021-01-19 18:21:59,750 INFO: Lock owner: None; I am acid-test-0
2021-01-19 18:21:59,750 INFO: waiting for leader to bootstrap

...and it keeps going ad infinitum. Adding more or decreasing the number of cluster instances won't help.

My speculation is that because Patroni now sit in the veil between life and death, all because there is some corruptions when I tried to repair all the damage my network outage caused (which caused OpenEBS to fail, so I have to do fsck manually on all PV[C]s). Since this is a test database and I haven't really made any backups yet (and for some reason backups didn't work as expected cause I can't even open the backup tab in the UI), but I still want to rescue it because from what I observed, the pgdata behind it is perfectly intact.

I've seen from an article that removing an etcd key would help bypass the "initialization" process. This however makes me wonder if the Patronl under the k8s operator would ever had used the internal etcd at all of what the k8s master nodes are using.

stevefan1999-personal commented 3 years ago

I have solved the problem myself, I have spotted that in the container going under postgres account though su - postgres, I cannot access pgdata folder. And it's all because:

d--------- 4 root     root 4096 Dec 21 12:38 pgdata

This happened so that maybe I should somehow reset the permission to 777 and voila. Maybe I should direct this issue back to upstream Patroni instead?

stevefan1999-personal commented 3 years ago

I tried manually starting the server via pg_ctl:

$ PGDATA=~/pgdata/pgroot/data pg_ctl start

It doesn't work. Log file shows that it does not support "recovery.conf" style of configuration for recovery (?). So I deleted the file and it seems to work. I have no idea why this is happening.

zalando / postgres-operator

Cluster failed to progress beyond "initialization" after initialization and corruption #1321