zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

constant `Archive % does not exist.` when a standby instance #2455

Open dennislapchenko opened 1 year ago

dennislapchenko commented 1 year ago

Please, answer some short questions which should help us to understand your problem / question better?

When instance is promoted to master -> these logs disappear ofcourse. We have seen similar logs when out of a cluster all instances were replicas. They couldnt acquire the leader. (old Endpoints in play

Maybe this log is not normal? but I have spent A LOT of time trying to see at all iseus where these history and archive files arent there.

Some general remarks when posting a bug report:

SNThrailkill commented 1 year ago

Im seeing this exact issue too. Unsure if there is a workaround or anything. I find that if I go back far enough I can restore but it keeps reoccurring.

rocket357 commented 12 months ago

Unless I'm misunderstanding the issue, this is the postgresql instance looking for the next not-yet available WAL file. It doesn't exist yet because it hasn't been pushed by the primary to the bucket. Once it does, the standby will find it, apply it, and start logging the next WAL file can't be found.

Unless I'm totally off the mark, that seems to be what is going on. Perhaps quieting down the logging on it could help, but I think it's working as intended (assuming someone associated with the project can confirm my understanding).

FxKu commented 11 months ago

Sounds like what @rocket357 already pointed out. Since this is a standby cluster in only streams changes from the WAL archive and the file is not there (anymore?). If the standby cluster is broken anyway, have you tried to set up a new one @dennislapchenko ? Does the streaming work in general?

Do you have retention in place for your WAL archive that could explain a missing history file? What was the series of events leading up to this situation?

rocket357 commented 11 months ago

An easy check to see if this is broken (or working as intended) is to create a dummy user in the primary database your standby is pulling WALs from, then drop the dummy user. Once the archive_timeout amount of time passes, you should see the WAL number the standby is complaining about finally found, downloaded, and applied, then the logged error switch to the next expected WAL file. That means the standby noticed the expected WAL file and moved to the next expected one, but since the next expected WAL file hasn't been archived by the primary yet the standby will start logging the next expected WAL file can't be found. This is working as intended.

If you write to the primary (i.e. create/drop a dummy user) and it doesn't update what WAL file it is complaining about after archive_timeout seconds pass, something is actually broken in the standby.