zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.37k stars 980 forks source link

consistent fails of logical backup #2746

Open nickmansrob opened 2 months ago

nickmansrob commented 2 months ago

Please, answer some short questions which should help us to understand your problem / question better?

Since we moved our production database to this operator, the logical backup hasn't succeeded. We get following error:

pg_dump: error: Dumping the contents of table "courses" failed: PQgetResult() failed.
pg_dump: detail: Error message from server: ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

The table is not always the same, but it is always a table where there's lots of activity.

We figured this error is due to data getting deleted from the replica (we have 1 master and 2 replicas) and pg_dump trying to still copy it, or something like that; we aren't experts in Postgres replication unfortunately. We are aware there are certain postgres parameters which we can set to try to avoid this, but we aren't sure which one, and we don't want the hassle of finding it out just to fix a cronjob which runs one time a day.

As the cronjob runs at night, when our load is fairly low (but still high enough to let the backup fail), we would like to use the master for this, as such problems do not arise then. Now, this is hardcoded in the backup script to use the replicas when possible.

I therefore propose an extra config parameter to force the usage of the master node. If there are other suggestions which are easier to implement or just a recommendation on what parameter we should use, that would also be great.