zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

Cluster failed to progress beyond "initialization" after initialization and corruption #1321

Open stevefan1999-personal opened 3 years ago

stevefan1999-personal commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Steps to reproduce:

  1. Create a cluster
  2. Have a catastrophic event that potentially corrupts Patroni data volume (mine was OpenEBS volume attaching crashing cascades)
  3. Try to restart the cluster
  4. See a lot of these lines in the log
    2021-01-19 18:19:19,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:19:29,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:19:29,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:19:39,751 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:19:39,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:19:49,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:19:49,750 INFO: waiting for leader to bootstrap
    2021-01-19 18:19:59,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:19:59,750 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:09,751 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:09,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:19,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:19,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:29,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:29,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:39,751 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:39,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:49,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:49,750 INFO: waiting for leader to bootstrap
    2021-01-19 18:20:59,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:20:59,750 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:09,751 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:09,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:19,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:19,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:29,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:29,750 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:39,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:39,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:49,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:49,751 INFO: waiting for leader to bootstrap
    2021-01-19 18:21:59,750 INFO: Lock owner: None; I am acid-test-0
    2021-01-19 18:21:59,750 INFO: waiting for leader to bootstrap    

    ...and it keeps going ad infinitum. Adding more or decreasing the number of cluster instances won't help.

My speculation is that because Patroni now sit in the veil between life and death, all because there is some corruptions when I tried to repair all the damage my network outage caused (which caused OpenEBS to fail, so I have to do fsck manually on all PV[C]s). Since this is a test database and I haven't really made any backups yet (and for some reason backups didn't work as expected cause I can't even open the backup tab in the UI), but I still want to rescue it because from what I observed, the pgdata behind it is perfectly intact.

I've seen from an article that removing an etcd key would help bypass the "initialization" process. This however makes me wonder if the Patronl under the k8s operator would ever had used the internal etcd at all of what the k8s master nodes are using.

stevefan1999-personal commented 3 years ago

I have solved the problem myself, I have spotted that in the container going under postgres account though su - postgres, I cannot access pgdata folder. And it's all because:

d--------- 4 root     root 4096 Dec 21 12:38 pgdata

This happened so that maybe I should somehow reset the permission to 777 and voila. Maybe I should direct this issue back to upstream Patroni instead?

stevefan1999-personal commented 3 years ago

I tried manually starting the server via pg_ctl:

$ PGDATA=~/pgdata/pgroot/data pg_ctl start

It doesn't work. Log file shows that it does not support "recovery.conf" style of configuration for recovery (?). So I deleted the file and it seems to work. I have no idea why this is happening.