zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.3k stars 974 forks source link

cloning existing cluster - Failed to bootstrap cluster - Clone failed #1358

Open arbe opened 3 years ago

arbe commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

I have some WAL backups in S3, all working fine, I can see base backups twice a day created, and all the related wal files.

I tried to setup a cluster cloning an existing one from the backup, by declaring a new Postgres CRD:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: mydb-postgresql-clone
spec:
  clone:
    uid: "633b6c86-a4d8-460d-8a54-fbf4128b74c8"
    cluster: "mydb-postgresql"
    timestamp: "2021-02-08T13:05:00+01:00"

  teamId: "mydb"
  volume:
    size: 2Gi
  numberOfInstances: 2
  postgresql:
    version: "13"

Once I create this in the cluster, and the postgres pods are up, I can see from the logs it can't find the backups

2021-02-10 13:19:34,231 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-simplrapi-postgresql" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-02-08T13:05:00+01:00"
2021-02-10 13:19:34,407 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/13/ for clone
INFO: 2021/02/10 13:19:34.544441 No backups found
2021-02-10 13:19:34,547 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/12/ for clone
INFO: 2021/02/10 13:19:34.678884 No backups found
2021-02-10 13:19:34,683 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/11/ for clone
INFO: 2021/02/10 13:19:34.861163 No backups found
2021-02-10 13:19:34,870 INFO: Trying s3://some-s3-name/spilo/simplrapi-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/10/ for clone
INFO: 2021/02/10 13:19:34.970748 No backups found
2021-02-10 13:19:34,976 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/9.6/ for clone
INFO: 2021/02/10 13:19:35.060207 No backups found
2021-02-10 13:19:35,067 INFO: Trying s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/9.5/ for clone
INFO: 2021/02/10 13:19:35.170388 No backups found
2021-02-10 13:19:35,178 INFO: Trying s3://postgres-wal-develop/mydb/simplrapi-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/ for clone
INFO: 2021/02/10 13:19:35.285631 No backups found
2021-02-10 13:19:35,292 ERROR: Clone failed

And after 5 attemps then:

patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: exceeded maximum number of restarts 5
stopping /run/service/patroni
timeout: finish: .: (pid 524) 9s, want down

But it should have found backups in the following directory: s3://some-s3-name/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/13/, as inside there is a "wal_005/" folder and a "basebackups_005/" folder containing backups, prior to the given timestamp.

What am I doing wrong?

FxKu commented 3 years ago

Hm, which Spilo version are you using, @arbe ?

arbe commented 3 years ago

Spilo version is registry.opensource.zalan.do/acid/spilo-13:2.0-p2.

Default for postgres-operator v1.6.0 I guess?

FxKu commented 3 years ago

I wonder why there are different WAL backup paths listed in the logs?

What do you have configured for wal_s3_bucket? Can you create a backup manually and see it in the bucket?

envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
arbe commented 3 years ago

I wonder why there are different WAL backup paths listed in the logs?

I'm sorry, this is me doing a terrible job at masking real names...it is one single WAL backup path all the time.

What do you have configured for log_s3_bucket?

I guess you mean wal_s3_bucket:

...
configuration:
  aws_or_gcp:
    wal_s3_bucket: "postgres-wal-develop"
...

I'm configuring it through OperatorConfiguration CRD if that matters..

Can you create a backup manually and see it in the bucket?

I'm getting this error

root@mydb-postgresql-0:/home/postgres# envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
2021-02-16 10:34:25.057 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
psql: error: FATAL:  role "root" does not exist
2021-02-16 10:34:25.172 - /scripts/postgres_backup.sh - ERROR: Recovery state unknown: 
arbe commented 3 years ago

Hi again, I just realized I need to be with postgres user to run that script manually, so here we go:

postgres@mydb-postgresql-0:~$ envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data"
2021-02-16 10:40:03.046 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
2021-02-16 10:40:03.228 - /scripts/postgres_backup.sh - producing a new backup
INFO: 2021/02/16 10:40:03.240638 Doing full backup.
INFO: 2021/02/16 10:40:03.249726 Calling pg_start_backup()
INFO: 2021/02/16 10:40:03.262873 Walking ...
INFO: 2021/02/16 10:40:03.263488 Starting part 1 ...
INFO: 2021/02/16 10:40:03.849450 Finished writing part 1.
INFO: 2021/02/16 10:40:04.235786 Starting part 2 ...
INFO: 2021/02/16 10:40:04.235822 /global/pg_control
INFO: 2021/02/16 10:40:04.236614 Finished writing part 2.
INFO: 2021/02/16 10:40:04.237224 Calling pg_stop_backup()
INFO: 2021/02/16 10:40:06.277327 Starting part 3 ...
INFO: 2021/02/16 10:40:06.292405 backup_label
INFO: 2021/02/16 10:40:06.292481 tablespace_map
INFO: 2021/02/16 10:40:06.292549 Finished writing part 3.
INFO: 2021/02/16 10:40:06.534766 Wrote backup with name base_00000003000000010000003B

I can see now the backup created in S3, in a path such as:

s3://postgres-wal-develop/spilo/mydb-postgresql/633b6c86-a4d8-460d-8a54-fbf4128b74c8/wal/13/wal_005/

There I can see two files, 00000003000000010000003B.00000028.backup.lz4 and 00000003000000010000003B.lz4

FxKu commented 3 years ago

So have you initialized a new clone and checked that it can read this backup? The clone is in the same AWS account, right?

arbe commented 3 years ago

Yes, I've initialized a new clone, in the same AWS account, same kubernetes cluster in fact... actually the clone manifest is being created in the same namespace where the original cluster is.

All I saw were the logs I provided, which do not seem to complain about access to the backups, but that it finally doesn't find a suitable one to clone

zagr0 commented 3 years ago

Hi, we are facing the same issue, not able to clone from s3 wals, operator 1.6.0

when try to clone we got in new cluster's logs:

...

2021-03-29 11:28:21,737 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-test-db" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-03-29T10:14:46.000+00:00"
2021-03-29 11:28:22,229 INFO: Trying s3://postgres-operator-backups-test-bucket/spilo/test-db/3076b492-fd12-41c5-9631-6aa5fb2cb249/wal/13/ for clone
2021-03-29 11:28:32,180 INFO: Lock owner: None; I am acid-test-restore-0
2021-03-29 11:28:32,180 INFO: not healthy enough for leader race
2021-03-29 11:28:32,191 INFO: bootstrap in progress
...
INFO: 2021/03/29 11:41:43.771577 No backups found
2021-03-29 11:41:43,775 INFO: Trying s3://postgres-operator-backups-test-bucket/spilo/test-db/3076b492-fd12-41c5-9631-6aa5fb2cb249/wal/12/ for clone
2021-03-29 11:41:52,179 INFO: Lock owner: None; I am acid-test-restore-0
2021-03-29 11:41:52,179 INFO: not healthy enough for leader race
2021-03-29 11:41:52,180 INFO: bootstrap in progress
...

and so on until the timeout all the files are present on s3 bucket with in the required path:

$ aws s3 ls s3://postgres-operator-backups-test-bucket/spilo/test-db/3076b492-fd12-41c5-9631-6aa5fb2cb249/wal/13/
                           PRE basebackups_005/
                           PRE wal_005/

$ aws s3 ls s3://postgres-operator-backups-test-bucket/spilo/test-db/3076b492-fd12-41c5-9631-6aa5fb2cb249/wal/13/basebackups_005/
                           PRE base_000000A200000001000000F8_00000040/
                           PRE base_000000A200000001000000FA_00000040/
                           PRE base_000000A200000001000000FC_00000040/
                           PRE base_000000A200000001000000FE_00000040/
                           PRE base_000000A50000000200000003_00000040/
                           PRE base_000000A70000000200000009_00000040/
2021-03-26 15:55:48        216 base_000000A200000001000000F8_00000040_backup_stop_sentinel.json
2021-03-27 02:30:05        216 base_000000A200000001000000FA_00000040_backup_stop_sentinel.json
2021-03-28 02:30:06        216 base_000000A200000001000000FC_00000040_backup_stop_sentinel.json
2021-03-29 03:30:06        216 base_000000A200000001000000FE_00000040_backup_stop_sentinel.json
2021-03-29 09:34:20        216 base_000000A50000000200000003_00000040_backup_stop_sentinel.json
2021-03-29 13:31:41        216 base_000000A70000000200000009_00000040_backup_stop_sentinel.json

$ aws s3 ls s3://postgres-operator-backups-test-bucket/spilo/test-db/3076b492-fd12-41c5-9631-6aa5fb2cb249/wal/13/wal_005/
2021-03-29 03:30:04      76707 000000A100000001000000F5.lzo
2021-03-29 08:51:09     142836 000000A100000001000000F7.partial.lzo
2021-03-26 15:54:07       1444 000000A2.history.lzo
2021-03-29 08:51:07    1490515 000000A200000001000000F7.lzo
2021-03-26 15:55:47        312 000000A200000001000000F8.00000028.backup.lzo
2021-03-29 08:51:08     175665 000000A200000001000000F8.lzo
2021-03-29 09:32:37     343698 000000A200000001000000F9.lzo
2021-03-27 02:30:05        305 000000A200000001000000FA.00000028.backup.lzo
2021-03-29 09:34:17      76694 000000A200000001000000FA.lzo
2021-03-29 09:38:44      76707 000000A200000001000000FB.lzo
2021-03-28 02:30:05        305 000000A200000001000000FC.00000028.backup.lzo
2021-03-29 11:38:35      76721 000000A200000001000000FC.lzo
2021-03-29 12:05:25      76711 000000A200000001000000FD.lzo
2021-03-29 09:34:19        305 000000A200000001000000FE.00000028.backup.lzo
2021-03-29 12:55:28      76694 000000A200000001000000FE.lzo
2021-03-29 13:30:39      76707 000000A200000001000000FF.lzo
2021-03-29 08:51:08      76625 000000A20000000200000000.lzo
2021-03-29 08:51:10      76608 000000A20000000200000001.partial.lzo
2021-03-29 08:51:07       1454 000000A3.history.lzo
2021-03-29 09:07:01    1274300 000000A30000000200000001.partial.lzo
2021-03-29 09:06:59       1471 000000A4.history.lzo
2021-03-29 09:32:37    2392151 000000A40000000200000001.lzo
2021-03-29 09:32:38    1488224 000000A40000000200000002.partial.lzo
2021-03-29 09:32:37       1478 000000A5.history.lzo
2021-03-29 09:33:34    1198921 000000A50000000200000002.lzo
2021-03-29 09:34:19        312 000000A50000000200000003.00000028.backup.lzo
2021-03-29 09:34:18     176675 000000A50000000200000003.lzo
2021-03-29 10:04:18     376483 000000A50000000200000004.lzo
2021-03-29 12:04:18      79851 000000A50000000200000005.lzo
2021-03-29 12:34:19      88923 000000A50000000200000006.lzo
2021-03-29 13:11:12     100914 000000A50000000200000007.partial.lzo
2021-03-29 13:11:10       1496 000000A6.history.lzo
2021-03-29 13:30:37    1262131 000000A60000000200000007.lzo
2021-03-29 13:30:40      76726 000000A60000000200000008.partial.lzo
2021-03-29 13:30:39       1505 000000A7.history.lzo
2021-03-29 13:31:05     957392 000000A70000000200000008.lzo
2021-03-29 13:31:41        310 000000A70000000200000009.00000028.backup.lzo
2021-03-29 13:31:41     194633 000000A70000000200000009.lzo
2021-03-29 14:01:40     254363 000000A7000000020000000A.lzo
riskingh commented 3 years ago

Hi! I have the same issue and for some reason on a cloned cluster I have AWS_INSTANCE_PROFILE=true in /run/etc/wal-e.d/env-clone-... though I declare AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in a pod_environment_configmap. Because of this wal-e backup-list can't see any backups. What is the correct way of passing access key to a cloned cluster?

EDIT: nvm it's described in docs that I need to use clone.s3_access_key_id

dekelmidlink commented 3 years ago

Same for me

ienliven commented 2 years ago

Any updates? I am facing the same problem.

munjalpatel commented 2 years ago

I am also running into the same issue. Any updates here?

Grey1406 commented 2 years ago

We also encountered a similar problem when creating a cloned cluster

the documentation says: https://postgres-operator.readthedocs.io/en/latest/user/#how-to-clone-an-existing-postgresql-cluster

If your source cluster uses a WAL location different from the global configuration you can specify the full path under s3_wal_path. For [Google Cloud Platform](https://postgres-operator.readthedocs.io/en/latest/administrator/#google-cloud-platform-setup) or [Azure](https://postgres-operator.readthedocs.io/en/latest/administrator/#azure-setup) it can only be set globally with [custom Pod environment variables](https://postgres-operator.readthedocs.io/en/latest/administrator/#custom-pod-environment-variables) or locally in the Postgres manifest's [env](https://postgres-operator.readthedocs.io/en/latest/administrator/#via-postgres-cluster-manifest) section.

that is, such a section should be quite enough if the backups themselves function:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-minimal-cluster-clone
spec:
  clone:
    uid: "efd12e58-5786-11e8-b5a7-06148230260c"
    cluster: "acid-minimal-cluster"
    timestamp: "2017-12-19T12:40:33+01:00"

but practice shows that this is not so, and for the proper functioning of cloning it is also mandatory to pass CLONE_* variables through custom-pod-environment-variables: https://postgres-operator.readthedocs.io/en/latest/administrator/#custom-pod-environment-variables

Based on this, the question arises, do I have something configured wrong in the configuration of the operator or cluster? or all the same, I'm doing everything right, but this fad is not explicitly indicated in the documentation?

p.s. with clone_ variables specified, cloning works flawlessly

p.p.s. if I described something indistinctly, please forgive my English

maaft commented 1 year ago

And what do I do when using azure backup location? I don't find any docs about any azure spezific CLONE_ variables..

Kapersyx commented 1 year ago

The documentation is split in parts, the guide to cloning is under the user section and a note about having to prepend env vars with clone_ is under the administration section https://postgres-operator.readthedocs.io/en/latest/administrator/#restoring-physical-backups