zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.28k stars 973 forks source link

Backups intermittently fail to upload due to "request body too small" error (wal-g > backblaze s3 storage) #2668

Open rrrru opened 3 months ago

rrrru commented 3 months ago

Description: I'm experiencing intermittent issues where backups fail to upload to my storage service, and they only succeed after several retries. Here are the details of my setup and the logs I'm seeing:

Which image of the operator are you using? ghcr.io/zalando/spilo-15:3.0-p1

Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s

Are you running Postgres Operator in production? yes

Type of issue? question / feature request

configuration details:

CLONE_METHOD:                     CLONE_WITH_WALG
CLONE_USE_WALG_RESTORE:           true
CLONE_WALG_DISABLE_S3_SSE:        true
CLONE_WALG_DOWNLOAD_CONCURRENCY:  10
USE_WALG_BACKUP:                  true
USE_WALG_RESTORE:                 true
WALG_DISABLE_S3_SSE:              true
BACKUP_NUM_TO_RETAIN:             14
AWS_ENDPOINT:                     https://s3.us-east-005.backblazeb2.com
AWS_REGION:                       us-east-005
...

Error Logs:

2024-06-18 08:07:34,425 INFO: no action. I am (<redacted>-0), the leader with the lock
ERROR: 2024/06/18 08:07:36.956490 failed to upload 'spilo/<redacted>/a1b0f288-129f-43f1-bd25-3b268451bef8/wal/15/basebackups_005/base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4' to bucket '<redacted>': MultipartUpload: upload multipart failed
        upload id: 4_z6e98243d0341d44787fc0c18_f229374ab973055a1_d20240618_m080701_c005_v0501020_t0002_u01718698021399
caused by: InvalidRequest: The request body was too small
        status code: 400, request id: 26e2e482f367dfc6, host id: aZbw4SjRfZPkz3zHVNCo3PDcZY/Jjajgb
ERROR: 2024/06/18 08:07:36.956505 upload: could not upload 'base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4'
ERROR: 2024/06/18 08:07:36.956511 failed to upload 'spilo/<redacted>/a1b0f288-129f-43f1-bd25-3b268451bef8/wal/15/basebackups_005/base_0000000D000007BF000000A3/tar_partitions/part_001.tar.lz4' to bucket '<redacted>': MultipartUpload: upload multipart failed
        upload id: 4_z6e98243d0341d44787fc0c18_f229374ab973055a1_d20240618_m080701_c005_v0501020_t0002_u01718698021399
caused by: InvalidRequest: The request body was too small
        status code: 400, request id: 26e2e482f367dfc6, host id: aZbw4SjRfZPkz3zHVNCo3PDcZY/Jjajgb
ERROR: 2024/06/18 08:07:36.956527 Unable to continue the backup process because of the loss of a part 1.
2024-06-18 08:07:44,428 INFO: no action. I am (<redacted>-0), the leader with the lock

Temporary Workaround: The backups succeed after several retries, which I manually trigger using: envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data

Request: Can you add retry logic for such errors when interacting with the storage service API? This would greatly enhance reliability by automatically handling transient issues.

rrrru commented 3 months ago

Perhaps I chose the wrong place to create issues and should have gone to wal-g.