zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.25k stars 969 forks source link

[question] pg_wal eat disk because inactive replication slot #2012

Open tydra-wang opened 2 years ago

tydra-wang commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

The first time I found my postgresql unavailable for 100% used pvc in a pod, I just expand the pvc. however, it failed again a few days later.

Finally I found out this may be caused by inactive replication slot. Using select * from pg_replication_slots in the master pod, I saw two inactive replication slot. I fixed it by recreating two replicas' pods and pvcs manually (kubectl delete pod and pvc) and it went back to normal then. The master cleaned wal after replication slots all being active.

I got a few questions about this problem:

Thanks!

related issue:

aikoven commented 1 year ago

Stumbled upon the same problem.

Used patronictl list to find out that two replicas were lagging:

root@my-postgres-0:/home/postgres# patronictl list
+ Cluster: my-postgres (7166512203438432325) -----+----+-----------+
| Member        | Host        | Role    | State   | TL | Lag in MB |
+---------------+-------------+---------+---------+----+-----------+
| my-postgres-0 | 10.0.102.49 | Leader  | running |  3 |           |
| my-postgres-1 | 10.0.103.13 | Replica | running |  2 |       515 |
| my-postgres-2 | 10.0.106.68 | Replica | running |  2 |       515 |
+---------------+-------------+---------+---------+----+-----------+

Then reinitiated these replicas using patronictl reinit (note that this will delete all data on the replica and pull it from the leader):

root@my-postgres-0:/home/postgres# patronictl reinit my-postgres my-postgres-1
root@my-postgres-0:/home/postgres# patronictl reinit my-postgres my-postgres-2

After a while, the WAL size decreased.