zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.3k stars 974 forks source link

Disaster Recovery #321

Open aamederen opened 6 years ago

aamederen commented 6 years ago

As far as I understand, the operator boots Spilo containers which have wal-e installed. Therefore, it would be useful to be able to provide wal-e parameters to make it ship periodic backups and wal to an object store like S3.

It would be nice to be able to:

alexeyklyukin commented 6 years ago

define the log shipping target and necessary info/credentials in a secret

There is a WAL S3 bucket and defined in the operator configuration. No secrets are supported yet, although WAL-E relies on the instance profile and there is a support for https://github.com/jtblin/kube2iam project in the form of the annotation supplied by the operator for the pods (regulated by the kube_iam_role parameter). See https://github.com/zalando-incubator/postgres-operator/blob/master/docs/operator_parameters.md for the details on the operator parameters. The support to define custom secrets for the pods coming up, which would allow you to define AWS variables (like AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY)

define backup period to let postgres operator trigger periodic backups automatically define a timespan for keeping backups and wal

Backups can be configured in Spilo (there is no need to trigger them from the operator), however, the operator currently doesn't make them configurable, deferring to the Spilo default of once a day for the BACKUP_SCHEDULE and 2 for the BACKUP_NUM_TO_RETAIN. See https://github.com/zalando/spilo/blob/master/postgres-appliance/scripts/configure_spilo.py.

and the operator could delete them automatically after they are old enough

The operator will not anything with the backups, this is the task of WAL-E called from Spilo (the operator is just an engine to set environment variables to Spilo and create/maintain Kubernetes objects). WAL-E already removes backups and WAL segments older than BACKUP_NUM_TO_RETAIN.

NB: as of yesterday, the Spilo behavior has been change to treat BACKUP_NUM_TO_RETAIN to days instead of number of backups.

Jan-M commented 6 years ago

As the word backup is also ambiguous, logical backups are not handled currently and we are not of the opinion the operator should be responsible here. We use Kubernetes jobs to do this. At some point though the operator may in fact create the K8S jobs but will not execute the backups or trigger them.

aelbarkani commented 5 years ago

there is the possibility to use the embedded wal-e for backups/restore as @alexeyklyukin said. you can customize the env variables: https://github.com/zalando-incubator/postgres-operator/pull/152 it would be great if we can use secrets instead of configmaps though. is there any ETA for this feature ?

Jan-M commented 5 years ago

I don't think we have an ETA for this now, the AWS credentials should imho come from kube2iam or similar solution that could do the setup. Is this not possible in your case?

excavador commented 5 years ago

Hello,

Thank you for the great operator.

Do we have any options to configure BACKUP_SCHEDULE and BACKUP_NUM_TO_RETAIN on operator side or still no? If no - can you provide any new specific ETA?

With best regards, Oleg

excavador commented 5 years ago

Hello again,

Do you have any updates about following question? https://github.com/zalando/postgres-operator/issues/321#issuecomment-480395543

FxKu commented 5 years ago

@excavador have you seen that the support for logical backups has been merged recently --> #442 It uses a K8s CronJob resource to create and push backups to S3.

excavador commented 5 years ago

@FxKu @Jan-M

Dear developers,

I confused so much.

My actual problem/task: A. I need to make a backup as soon as I want (for instance, every hour) B. I want to keep as many backups as I want (for instance, infinite, or per-hour backup during last week and per-week backup during the last half-year and per-month backup forever) C. I want to be able to restore from any backup which I want D. Ideally, I want to be able to restore to any specific point of time. WAL-E provide a way to do it: get the latest suitable backup + apply part of WAL logs copy. In this case, I want to keep WAL logs forever, and full backup required only for faster recovery.

What we have right now?

  1. old-school backups to S3 using WAL-E A. postgres-operator does NOT provide this functionality. spilo provides the way to configure backup schedule, using environment variable BACKUP_SCHEDULE. https://github.com/zalando/spilo/blob/master/ENVIRONMENT.rst B. postgres-operator does NOT provide this functionality. spilo internally has option BACKUP_NUM_TO_RETAIN missed in documentation. It partially addresses my request, but not fully. https://github.com/zalando/spilo/blob/96813721e9b23cbe9d2511ffdf93db2a4d73f7ed/postgres-appliance/scripts/configure_spilo.py C. postgres-operator provides this functionality. D. postgres-operator does NOT provide this functionality. spilo does NOT provide this functionality. WAL-E provides this functionality. https://github.com/zalando/postgres-operator/issues/569

  2. logical backups A. postgres-operator provides this functionality B. postgres-operator provides this functionality C. postgres-operator does NOT provide this functionality. It's ridiculous from my point of view: why we need backup without an ability to restore from this backup? https://github.com/zalando/postgres-operator/issues/568 D. postgres-operator does NOT provide this functionality.

I would be so much happy to understand how to achieve my goals :)

Thank you so much!

Jan-M commented 5 years ago

We do not cover this use case completely.

We currently only support via Spilo to keep N basebackups plus the continuous WAL stream for that time frame. Lets assume for now 7 days. This means PITR and restore from basebackup and WAL within these days is possible.

Logical backup serves the purpose of storing compressed smaller dumps for a longer period of time while losing point in time recovery capability beyond the 7 days. Restore from logical is not covered yet. I assume this will be another custom bootstrap or a combination of DB + 1 job deployment in the future.

@CyberDem0n can maybe comment on if your usecase of one very old base back (lets say initial one plus WAL) is there and would work. There is a little bit of doubt in my mind when it comes to the fail over behavior and the continuity of WAL. Thats why e.g. Spilo ships a base backup after promote.

excavador commented 5 years ago

We currently only support via Spilo to keep N basebackups plus the continuous WAL stream for that time frame. Lets assume for now 7 days. This means PITR and restore from basebackup and WAL within these days is possible.

Yes, in theory. Actually, spilo directly choose latest available backup for specified date (in case of cluster "clone" from backup)

Logical backup serves the purpose of storing compressed smaller dumps for a longer period of time while losing point in time recovery capability beyond the 7 days. Restore from logical is not covered yet.

Yeah, it described in https://github.com/zalando/postgres-operator/issues/568

I assume this will be another custom bootstrap or a combination of DB + 1 job deployment in the future

Ideally, I prefer to see ability to "clone" from logical backup, like we already have for plain backups

Jan-M commented 5 years ago

Yeah, for the clone I guess I was mixing spec and implementation.

excavador commented 5 years ago

@Jan-M so, shortly, I just want to have ability to configure backup frequency using Postgres CRD object manifest, and flexible options to restore (using "clone") with Postgres CRD object manifest. I don't care how exactly it is implemented - using WAL-E or logical backup, as far as I have S3 bucket with backups (applicable in some way out-side operator just in case) and ability to cover A, B, C, D from my comment https://github.com/zalando/postgres-operator/issues/321#issuecomment-494807643 using Postgres CRD object

Jan-M commented 5 years ago

Over time the operator will improve in that area exposing more options to the manifest where it makes sense and leaving some options to be determinded on the global level.

On some issues we are working and others are always open to be contributed to.

excavador commented 5 years ago

Hi! Do you have any updates about https://github.com/zalando/postgres-operator/issues/321#issuecomment-494807643 and https://github.com/zalando/postgres-operator/issues/321#issuecomment-494839746 ?

Best regards, Oleg