constant `Archive % does not exist.` when a standby instance

dennislapchenko commented 1 year ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? e.g. postgres-operator-1.10:stable
Where do you run it - cloud or metal? Kubernetes or OpenShift? EKS/GKE
Are you running Postgres Operator in production? yet

Type of issue? When an instance is in standby, after successful bootstrap it constantly prints logs like these:

gitlab-db-0 postgres 2023-10-20 15:32:34,871 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:35.291547 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:35.960781 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:39.279282 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:40.207244 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:44.311141 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres 2023-10-20 15:32:44,667 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:45.226837 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:49.296045 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:50.111457 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:54.326532 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres 2023-10-20 15:32:54,713 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:55.124100 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:32:59.275486 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:00.092755 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:04.019019 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres 2023-10-20 15:33:04,709 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:04.811841 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:09.324308 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:09.882814 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:14.333704 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres 2023-10-20 15:33:14,711 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:14.875350 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:19.440107 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:20.276423 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:24.113552 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres 2023-10-20 15:33:24,713 INFO: no action. I am (gitlab-db-0), the standby leader with the lock
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:24.913769 Archive '00000002.history' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:29.340052 Archive '000000010000000A000000DA' does not exist.
gitlab-db-0 postgres ERROR: 2023/10/20 15:33:30.152439 Archive '00000002.history' does not exist.

When instance is promoted to master -> these logs disappear ofcourse. We have seen similar logs when out of a cluster all instances were replicas. They couldnt acquire the leader. (old Endpoints in play

Maybe this log is not normal? but I have spent A LOT of time trying to see at all iseus where these history and archive files arent there.

Some general remarks when posting a bug report:

Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.
If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

SNThrailkill commented 1 year ago

Im seeing this exact issue too. Unsure if there is a workaround or anything. I find that if I go back far enough I can restore but it keeps reoccurring.

rocket357 commented 12 months ago

Unless I'm misunderstanding the issue, this is the postgresql instance looking for the next not-yet available WAL file. It doesn't exist yet because it hasn't been pushed by the primary to the bucket. Once it does, the standby will find it, apply it, and start logging the next WAL file can't be found.

Unless I'm totally off the mark, that seems to be what is going on. Perhaps quieting down the logging on it could help, but I think it's working as intended (assuming someone associated with the project can confirm my understanding).

FxKu commented 11 months ago

Sounds like what @rocket357 already pointed out. Since this is a standby cluster in only streams changes from the WAL archive and the file is not there (anymore?). If the standby cluster is broken anyway, have you tried to set up a new one @dennislapchenko ? Does the streaming work in general?

Do you have retention in place for your WAL archive that could explain a missing history file? What was the series of events leading up to this situation?

rocket357 commented 11 months ago

An easy check to see if this is broken (or working as intended) is to create a dummy user in the primary database your standby is pulling WALs from, then drop the dummy user. Once the archive_timeout amount of time passes, you should see the WAL number the standby is complaining about finally found, downloaded, and applied, then the logged error switch to the next expected WAL file. That means the standby noticed the expected WAL file and moved to the next expected one, but since the next expected WAL file hasn't been archived by the primary yet the standby will start logging the next expected WAL file can't be found. This is working as intended.

If you write to the primary (i.e. create/drop a dummy user) and it doesn't update what WAL file it is complaining about after archive_timeout seconds pass, something is actually broken in the standby.

zalando / postgres-operator

constant `Archive % does not exist.` when a standby instance #2455