Open davidkarlsen opened 3 years ago
ok, I did a spilo-resync on the replica, and now it failed over just fine.
so this is not a operator problem, but maybe if anyone knows a good metric to check when wal sync is lagging, then we can be on top of it.
@davidkarlsen how did you execute spilo-resync on the replica? there seems to be no doc about this
I no longer work at the place and don't have my notes, but exec into one of the pods, ps xa and see what users run, su to the spolo one ( I think it was a separate user patroni or the likes, or it was postgres user). then run spilo / patroni with --help
thanks i have fallen back to simply bootstrapping the node again, by renaming the data dir. Will check out your comments though
Please, answer some short questions which should help us to understand your problem / question better?
Which image of the operator are you using? 1.6.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] openshift
Are you running Postgres Operator in production? [yes | no] no
Type of issue? [Bug report, question, feature request, etc.] question/bug
I am cordoning the node which hosts the master pod, and the operator reports:
The reason for it failing seems to be replication lag:
my manifest is:
what could cause the replication lag, and why is it not picking up? the database is basically idle is there a metric one can track, in order to raise alerts when there is too large lag?