Hello everyone. At first i want to thank you guys for your cool postgresql HA solutions and k8s operator. Unfortunately we have an issue with restoring(cloning) from wal-e backups in GCS.
2023-10-10 16:24:59,174 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-10-10 16:24:59,180 INFO: Lock owner: None; I am test
2023-10-10 16:24:59,213 INFO: trying to bootstrap a new cluster
2023-10-10 16:24:59,213 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-prod" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-10-05T18:06:52+00:00"
2023-10-10 16:24:59,422 INFO: Trying gs://somebucket/spilo/prod/<UID>/wal/15/ for clone
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "backup-list".
STRUCTURED: time=2023-10-10T16:24:59.724227-00 pid=204
2023-10-10 16:25:00,304 ERROR: Clone failed
Traceback (most recent call last):
File "/scripts/clone_with_wale.py", line 185, in main
run_clone_from_s3(options)
File "/scripts/clone_with_wale.py", line 166, in run_clone_from_s3
backup_name, update_envdir = find_backup(options.recovery_target_time, env)
File "/scripts/clone_with_wale.py", line 153, in find_backup
backup = choose_backup(backup_list, recovery_target_time)
File "/scripts/clone_with_wale.py", line 74, in choose_backup
if last_modified < recovery_target_time:
TypeError: can't compare offset-naive and offset-aware datetimes
We analyzed the source code of spilo a bit and found the route cause.
So script clone_with_wale.py executes wal-e backup-listcommand and tries to parse the output to get the timestamp. The output is returned in format
So timestamp here should be 2021-06-23 01:00:14.498000+00:00 but only the first part (2021-06-23) of the timestamp is used when being compared to the recovery timestamp. Because of this an error happens
TypeError: can't compare offset-naive and offset-aware datetimes
We fixed this issue by making a custom image of spilo and applying this patch
diff --git a/postgres-appliance/bootstrap/clone_with_wale.py b/postgres-appliance/bootstrap/clone_with_wale.py
index e8d3196..e6c6b12 100755
--- a/postgres-appliance/bootstrap/clone_with_wale.py
+++ b/postgres-appliance/bootstrap/clone_with_wale.py
@@ -62,7 +62,7 @@ def fix_output(output):
if started:
line = line.replace(' modified ', ' last_modified ')
if started:
- yield '\t'.join(line.split())
+ yield '\t'.join(line.split('\t'))
def choose_backup(backup_list, recovery_target_time):
We can make a PR to fix it the issue in original image but we are not sure that
this repo is still maintained
this will not brake the s3 wal-e backups (i guess there should be tests in CI that check that)
Hello everyone. At first i want to thank you guys for your cool postgresql HA solutions and k8s operator. Unfortunately we have an issue with restoring(cloning) from wal-e backups in GCS.
Environment Spilo image - ghcr.io/zalando/spilo-15:3.0-p1 Postgres operator - registry.opensource.zalan.do/acid/postgres-operator:v1.10.1
Postgres crd
When container starts it has this errors in logs
We analyzed the source code of spilo a bit and found the route cause. So script clone_with_wale.py executes
wal-e backup-list
command and tries to parse the output to get the timestamp. The output is returned in formatSo timestamp here should be
2021-06-23 01:00:14.498000+00:00
but only the first part (2021-06-23
) of the timestamp is used when being compared to the recovery timestamp. Because of this an error happensWe fixed this issue by making a custom image of spilo and applying this patch
We can make a PR to fix it the issue in original image but we are not sure that