The master instance of postgres seeks to restore a WAL that does not exist in the storage

juanignacionavarro8 commented 1 year ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
Are you running Postgres Operator in production? no
Type of issue? question

The CRD manifest is as follows: kind: "postgresql" apiVersion: "acid.zalan.do/v1"

metadata: name: "postgresql" namespace: "postgres-databases" spec: teamId: "devops-cluster" numberOfInstances: 3 enableMasterLoadBalancer: false enableMasterPoolerLoadBalancer: false enableReplicaLoadBalancer: false enableReplicaPoolerLoadBalancer: false enableConnectionPooler: true enableReplicaConnectionPooler: false
tolerations:

key: "kubernetes.io/hostname" operator: "Equal" value: "k8s-nodo02" effect: "NoSchedule"
key: "kubernetes.io/hostname" operator: "Equal" value: "k8s-nodo03" effect: "NoSchedule"
key: "kubernetes.io/hostname" operator: "Equal" value: "k8s-nodo04" effect: "NoSchedule"
postgresql: version: "15" patroni: synchronous_mode: true synchronous_mode_strict: true synchronous_node_count: 1 ttl: 30 loop_wait: 10 retry_timeout: 10 pg_hba:
- local all all trust
- host replication standby all md5
- host all all all md5
- local replication standby trust volume: size: "3Gi" storageClass: "nfs-client" users: mission_planner:
superuser
createdb
login
replication gst_scheduler:
superuser
createdb
login
replication ti_core:
superuser
createdb
login
replication scenario_generator:
superuser
createdb
login
replication redmine:
superuser
createdb
login
replication keycloak:
superuser
createdb
login
replication airflow:
superuser
createdb
login
replication grafana:
superuser
createdb
login
replication databases: mission_planner: mission_planner gst_scheduler: gst_scheduler ti_core: ti_core scenario_generator_db: scenario_generator redmine: redmine keycloak: keycloak airflow: airflow grafana: grafana resources: requests: cpu: 100m memory: 100Mi limits: cpu: 250m memory: 250Mi

The configMap to configure some variables about backups and restore is as follows:

apiVersion: v1 kind: ConfigMap metadata: name: ssh-walg-config data: USE_WALG_BACKUP: "true" USE_WALG_RESTORE: "true" BACKUP_SCHEDULE: 00 15 * WALG_SSH_PREFIX: "ssh://A.A.A.A/home/administrator/zalando" SSH_PORT: "22" SSH_USERNAME: "administrator" SSH_PRIVATE_KEY_PATH: "/home/postgres/ssh-secret/ssh_secret.pem"
BACKUP_NUM_TO_RETAIN: "6" CLONE_USE_WALG_RESTORE: "true" PATRONI_LOG_LEVEL: DEBUG

I am running a high availability postgres cluster on a bare metal kubernetes cluster with a master node and 3 worker nodes. Each node of the kubernetes cluster runs in a virtual machine with Ubuntu 20.04.4 LTS.

During a weekend, the virtual machines are shut down. When the VMs are powered on Monday the database cluster is not restored to a consistent, operable state. The log file located in the /home/postgres/pgdata/pgroot/pg_log directory is as follows:

2023-08-07 19:01:58 UTC [17924]: [5-1] 64d13fa5.4604 0 LOG: ending log output to stderr 2023-08-07 19:01:58 UTC [17924]: [6-1] 64d13fa5.4604 0 HINT: Future log output will go to log destination "csvlog". ERROR: 2023/08/07 19:07:04.066092 Archive '0000000E.history' does not exist. INFO: 2023/08/07 19:07:08.111082 WAL-prefetch file: 0000000D00000002000000D3 INFO: 2023/08/07 19:07:08.135188 WAL-prefetch file: 0000000D00000002000000D2 INFO: 2023/08/07 19:07:08.135316 WAL-prefetch file: 0000000D00000002000000D1 INFO: 2023/08/07 19:07:08.145629 WAL-prefetch file: 0000000D00000002000000D0 ERROR: 2023/08/07 19:07:08.388779 Archive '0000000D00000002000000D2' does not exist.

ERROR: 2023/08/07 19:07:08.451587 Archive '0000000D00000002000000D3' does not exist.

ERROR: 2023/08/07 19:07:08.475535 Archive '0000000D00000002000000D1' does not exist.

ERROR: 2023/08/07 19:07:08.498712 Archive '0000000D00000002000000D0' does not exist.

ERROR: 2023/08/07 19:07:09.770492 Archive '0000000D00000002000000D0' does not exist. ERROR: 2023/08/07 19:07:11.349694 Archive '0000000E.history' does not exist. ERROR: 2023/08/07 19:07:12.782602 Archive '0000000D00000002000000D0' does not exist. ERROR: 2023/08/07 19:07:14.147067 Archive '0000000E.history' does not exist. ERROR: 2023/08/07 19:07:17.712170 Archive '0000000D00000002000000D0' does not exist. INFO: 2023/08/07 19:07:20.641808 WAL-prefetch file: 0000000D00000002000000D0 INFO: 2023/08/07 19:07:20.648000 WAL-prefetch file: 0000000D00000002000000D1 INFO: 2023/08/07 19:07:20.652181 WAL-prefetch file: 0000000D00000002000000D3 INFO: 2023/08/07 19:07:20.652225 WAL-prefetch file: 0000000D00000002000000D2 ERROR: 2023/08/07 19:07:20.788793 Archive '0000000D00000002000000D3' does not exist.

ERROR: 2023/08/07 19:07:20.817984 Archive '0000000D00000002000000D2' does not exist.

ERROR: 2023/08/07 19:07:20.842892 Archive '0000000D00000002000000D0' does not exist.

ERROR: 2023/08/07 19:07:20.873720 Archive '0000000D00000002000000D1' does not exist.

On another virtual machine with IP A.A.A.A.A. located outside the cluster where the WAL files and backups are stored, I have WAL records older than the specific WAL I am looking to restore. Below are the latest WAL records stored:

-rw-rw-r-- 1 administrator administrator 477623 Aug 5 23:07 0000000D00000002000000B6.lz4 -rw-rw-r-- 1 administrator administrator 488396 Aug 5 23:37 0000000D00000002000000B7.lz4 -rw-rw-r-- 1 administrator administrator 476858 Aug 6 00:07 0000000D00000002000000B8.lz4 -rw-rw-r-- 1 administrator administrator 504733 Aug 6 00:37 0000000D00000002000000B9.lz4 -rw-rw-r-- 1 administrator administrator 487496 Aug 6 01:08 0000000D00000002000000BA.lz4 -rw-rw-r-- 1 administrator administrator 491863 Aug 6 01:38 0000000D00000002000000BB.lz4 -rw-rw-r-- 1 administrator administrator 509570 Aug 6 02:08 0000000D00000002000000BC.lz4 -rw-rw-r-- 1 administrator administrator 497968 Aug 6 02:38 0000000D00000002000000BD.lz4 -rw-rw-r-- 1 administrator administrator 491471 Aug 6 03:08 0000000D00000002000000BE.lz4 -rw-rw-r-- 1 administrator administrator 497475 Aug 6 03:38 0000000D00000002000000BF.lz4 -rw-rw-r-- 1 administrator administrator 508636 Aug 6 04:08 0000000D00000002000000C0.lz4 -rw-rw-r-- 1 administrator administrator 494385 Aug 6 04:38 0000000D00000002000000C1.lz4 -rw-rw-r-- 1 administrator administrator 470476 Aug 6 05:09 0000000D00000002000000C2.lz4 -rw-rw-r-- 1 administrator administrator 410598 Aug 6 05:39 0000000D00000002000000C3.lz4 -rw-rw-r-- 1 administrator administrator 379164 Aug 6 06:09 0000000D00000002000000C4.lz4 -rw-rw-r-- 1 administrator administrator 380659 Aug 6 06:39 0000000D00000002000000C5.lz4 -rw-rw-r-- 1 administrator administrator 485107 Aug 6 07:09 0000000D00000002000000C6.lz4 -rw-rw-r-- 1 administrator administrator 507628 Aug 6 07:39 0000000D00000002000000C7.lz4 -rw-rw-r-- 1 administrator administrator 487685 Aug 6 08:09 0000000D00000002000000C8.lz4 -rw-rw-r-- 1 administrator administrator 496655 Aug 6 08:40 0000000D00000002000000C9.lz4 -rw-rw-r-- 1 administrator administrator 497773 Aug 6 09:10 0000000D00000002000000CA.lz4 -rw-rw-r-- 1 administrator administrator 497554 Aug 6 09:40 0000000D00000002000000CB.lz4 -rw-rw-r-- 1 administrator administrator 505118 Aug 6 10:10 0000000D00000002000000CC.lz4 -rw-rw-r-- 1 administrator administrator 505563 Aug 6 10:40 0000000D00000002000000CD.lz4 -rw-rw-r-- 1 administrator administrator 478637 Aug 6 11:10 0000000D00000002000000CE.lz4 -rw-rw-r-- 1 administrator administrator 7 Aug 6 11:13 0000000D00000002000000CF.lz4 -rw-rw-r-- 1 administrator administrator 222 Aug 4 15:07 0000000D.history.lz4

Why are you looking to restore a WAL file later than the WALs stored on the storage external to the cluster? What could be the reason for this inconsistency? How can I solve this problem?

Thanks in advance for your help, I will be attentive if I can provide more information to help you further detail this situation.

Tim-herbie commented 1 year ago

I don´t know if this fix your problem, but I saw here that the WALG_SSH_PREFIX parameter is filled with a ":" after the IP address and I can´t see that in your code.

So it should maybe this: WALG_SSH_PREFIX: "ssh://A.A.A.A:/home/administrator/zalando"

juanignacionavarro8 commented 7 months ago

Hi Tim, thanks for your reply !

I put in the variable WALG_SSH_PREFIX the value A.A.A.A.A to replace the IP of the VM I am using. I don't think that's the error, since in the process of correctly setting that variable I observed different errors through the files located in the /home/postgres/pgdata/pgroot/pg_log/*.log directory until I achieve the correct communication between my postgres pods and the VM where I store the WAL logs.

I did not find the source or cause of this error, but I was able to fix my cluster and get it back up and running. To fix it, you must restore a backup through WAL-G (or WAL-E for those who use it). To do this, I listed the available backups within my postgres pod that had problems using the command:

envdir "/run/etc/wal-e.d/env" wal-g backup-list

Since the postgres data inside a pod is located (by default) in the path /home/postgres/pgdata/pgroot , I started by creating a backup of the directory contents with the command:

cp -r /home/postgres/pgdata/pgroot /home/postgres/pgroot-backup

I then obtained one of the backups I had listed above by running:

wal-g backup-fetch pgdata-backup .

Next I replace the new data (corresponding to the backup) with the previous data using the command:

cp -r /home/postgres/pgdata-backup/* /home/postgres/pgdata/pgroot/

Once finished, the Patroni instance of that pod has to be restarted using the command:

patronictl reinit .

Then you can see how the failed instance joins the PostgreSQL cluster within the roles they have in my cluster (using the command "patronictl list" my cluster in healthy state has a leader, a synchronous replica with role "Sync Standby" and an asynchronous replica with role "Replica").

Bridging the gap with my cluster configuration, you can mount a backup using the tool that manages the WAL logs and restart the patroni configuration for that instance to bring the PostgreSQL cluster back to a consistent and healthy state, resulting in the loss of data from the last daily backup performed until the postgres pods got out of sync with the VM where I store my WAL logs.

zalando / postgres-operator

The master instance of postgres seeks to restore a WAL that does not exist in the storage #2390