Description:
I'm experiencing intermittent issues where backups fail to upload to my storage service, and they only succeed after several retries. Here are the details of my setup and the logs I'm seeing:
Which image of the operator are you using?
ghcr.io/zalando/spilo-15:3.0-p1
Where do you run it - cloud or metal? Kubernetes or OpenShift?
Bare Metal K8s
Are you running Postgres Operator in production?
yes
2024-06-18 08:07:34,425 INFO: no action. I am (<redacted>-0), the leader with the lock
ERROR: 2024/06/18 08:07:36.956490 failed to upload 'spilo/<redacted>/a1b0f288-129f-43f1-bd25-3b268451bef8/wal/15/basebackups_005/base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4' to bucket '<redacted>': MultipartUpload: upload multipart failed
upload id: 4_z6e98243d0341d44787fc0c18_f229374ab973055a1_d20240618_m080701_c005_v0501020_t0002_u01718698021399
caused by: InvalidRequest: The request body was too small
status code: 400, request id: 26e2e482f367dfc6, host id: aZbw4SjRfZPkz3zHVNCo3PDcZY/Jjajgb
ERROR: 2024/06/18 08:07:36.956505 upload: could not upload 'base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4'
ERROR: 2024/06/18 08:07:36.956511 failed to upload 'spilo/<redacted>/a1b0f288-129f-43f1-bd25-3b268451bef8/wal/15/basebackups_005/base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4' to bucket '<redacted>': MultipartUpload: upload multipart failed
upload id: 4_z6e98243d0341d44787fc0c18_f229374ab973055a1_d20240618_m080701_c005_v0501020_t0002_u01718698021399
caused by: InvalidRequest: The request body was too small
status code: 400, request id: 26e2e482f367dfc6, host id: aZbw4SjRfZPkz3zHVNCo3PDcZY/Jjajgb
ERROR: 2024/06/18 08:07:36.956527 Unable to continue the backup process because of the loss of a part 1.
2024-06-18 08:07:44,428 INFO: no action. I am (<redacted>-0), the leader with the lock
Temporary Workaround:
The backups succeed after several retries, which I manually trigger using:
envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
Request:
Can you add retry logic for such errors when interacting with the storage service API? This would greatly enhance reliability by automatically handling transient issues.
Description: I'm experiencing intermittent issues where backups fail to upload to my storage service, and they only succeed after several retries. Here are the details of my setup and the logs I'm seeing:
Which image of the operator are you using? ghcr.io/zalando/spilo-15:3.0-p1
Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
Are you running Postgres Operator in production? yes
Type of issue? question / feature request
configuration details:
Error Logs:
Temporary Workaround: The backups succeed after several retries, which I manually trigger using:
envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
Request: Can you add retry logic for such errors when interacting with the storage service API? This would greatly enhance reliability by automatically handling transient issues.