zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.55k stars 382 forks source link

WAL-E clone GCS issue #935

Open ggramal opened 12 months ago

ggramal commented 12 months ago

Hello everyone. At first i want to thank you guys for your cool postgresql HA solutions and k8s operator. Unfortunately we have an issue with restoring(cloning) from wal-e backups in GCS.

Environment Spilo image - ghcr.io/zalando/spilo-15:3.0-p1 Postgres operator - registry.opensource.zalan.do/acid/postgres-operator:v1.10.1

Postgres crd

kind: postgresql
metadata:
  name: test
spec:
  clone:
    uid: "<UID>"
    cluster: "prod"
    timestamp: "2023-10-05T18:06:52+00:00"
 .....

When container starts it has this errors in logs

2023-10-10 16:24:59,174 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-10-10 16:24:59,180 INFO: Lock owner: None; I am test
2023-10-10 16:24:59,213 INFO: trying to bootstrap a new cluster
2023-10-10 16:24:59,213 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-prod" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-10-05T18:06:52+00:00"
2023-10-10 16:24:59,422 INFO: Trying gs://somebucket/spilo/prod/<UID>/wal/15/ for clone
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-list".
        STRUCTURED: time=2023-10-10T16:24:59.724227-00 pid=204
2023-10-10 16:25:00,304 ERROR: Clone failed
Traceback (most recent call last):
  File "/scripts/clone_with_wale.py", line 185, in main
    run_clone_from_s3(options)
  File "/scripts/clone_with_wale.py", line 166, in run_clone_from_s3
    backup_name, update_envdir = find_backup(options.recovery_target_time, env)
  File "/scripts/clone_with_wale.py", line 153, in find_backup
    backup = choose_backup(backup_list, recovery_target_time)
  File "/scripts/clone_with_wale.py", line 74, in choose_backup
    if last_modified < recovery_target_time:
TypeError: can't compare offset-naive and offset-aware datetimes

We analyzed the source code of spilo a bit and found the route cause. So script clone_with_wale.py executes wal-e backup-list command and tries to parse the output to get the timestamp. The output is returned in format

name    last_modified   expanded_size_bytes wal_segment_backup_start    wal_segment_offset_backup_start wal_segment_backup_stop wal_segment_offset_backup_stop
base_00000005000000000000001B_00000040  2021-06-23 01:00:14.498000+00:00        00000005000000000000001B    00000040

So timestamp here should be 2021-06-23 01:00:14.498000+00:00 but only the first part (2021-06-23) of the timestamp is used when being compared to the recovery timestamp. Because of this an error happens

TypeError: can't compare offset-naive and offset-aware datetimes

We fixed this issue by making a custom image of spilo and applying this patch

diff --git a/postgres-appliance/bootstrap/clone_with_wale.py b/postgres-appliance/bootstrap/clone_with_wale.py
index e8d3196..e6c6b12 100755
--- a/postgres-appliance/bootstrap/clone_with_wale.py
+++ b/postgres-appliance/bootstrap/clone_with_wale.py
@@ -62,7 +62,7 @@ def fix_output(output):
             if started:
                 line = line.replace(' modified ', ' last_modified ')
         if started:
-            yield '\t'.join(line.split())
+            yield '\t'.join(line.split('\t'))

 def choose_backup(backup_list, recovery_target_time):

We can make a PR to fix it the issue in original image but we are not sure that

ggramal commented 12 months ago

I guess there are also people having the same issue https://github.com/zalando/spilo/pull/301#issuecomment-871236466 https://github.com/zalando/spilo/pull/301#issuecomment-1151125706