DB failover recovery fails when too much lag with master - how to configure restore_command?

vfrans commented 3 years ago

Hello,

We are running postgres-operator v1.50, deploying postgres image: spilo-12:1.6-p5.

Here is the yaml we use to deploy our cluster (edited for privacy reasons):

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: db-cluster
  namespace: oursoft
spec:
  teamId: "someteam"
  volume:
    size: 1Gi
    storageClass: postgres-storageclass
  numberOfInstances: 3
  users:
    dbuser: []

  databases:
    ourdb: dbowner
    keycloak: keycloak
  patroni:
    pg_hba:
    - host all all all md5

  postgresql:
    version: "12"
    parameters:
      archive_command: "/opt/archive-utils/archive.sh %p"
      shared_buffers: "256MB"
      log_min_duration_statement: "1000"
      log_statement: "all"
      log_connections: "on"
      log_disconnections: "on"
      log_checkpoints: "on"
      checkpoint_timeout: "5min"
      max_connections: "500"
      wal_log_hints: "on"
      random_page_cost: "2"
      wal_keep_segments: "100"

We use the following configmap to configure our operator:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator
  namespace: postgres-operator
data:
  api_port: "8080"
  aws_region: eu-central-1
  cluster_domain: cluster.local
  cluster_history_entries: "1000"
  cluster_labels: application:spilo
  cluster_name_label: cluster-name
  connection_pooler_image: "registry.opensource.zalan.do/acid/pgbouncer:master-8"
  db_hosted_zone: db.example.com
  debug_logging: "true"
  docker_image: {{ global_docker_repo }}/spilo-12:{{ spilo_12_version }}
  enable_master_load_balancer: "false"
  enable_pod_antiaffinity: "true"
  enable_replica_load_balancer: "false"
  enable_teams_api: "false"
  logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup"
  logical_backup_s3_bucket: "my-bucket-url"
  logical_backup_s3_sse: "AES256"
  logical_backup_schedule: "30 00 * * *"
  master_dns_name_format: "{cluster}.{team}.{hostedzone}"
  pdb_name_format: "postgres-{cluster}-pdb"
  pod_deletion_wait_timeout: 10m
  pod_label_wait_timeout: 10m
  pod_management_policy: "ordered_ready"
  pod_role_label: spilo-role
  pod_service_account_name: "postgres-pod"
  pod_terminate_grace_period: 5m
  ready_wait_interval: 3s
  ready_wait_timeout: 30s
  repair_period: 5m
  replica_dns_name_format: "{cluster}-repl.{team}.{hostedzone}"
  replication_username: standby
  resource_check_interval: 3s
  resource_check_timeout: 10m
  resync_period: 30m
  ring_log_lines: "100"
  secret_name_template: "{username}.{cluster}.credentials"
  spilo_privileged: "false"
  super_username: postgres
  watched_namespace: "*"  # listen to all namespaces
  workers: "8"

When we try to test db failure, it performs really well and fast when the database is not under heavy load.

However, when the database is loaded, we observe such failure in secondary db resync with master:

2020-11-16 09:49:10.885 UTC,,,5942,,5fb24b16.1736,2,,2020-11-16 09:49:10 UTC,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested WAL segment 0000004A000000050000002A has already been removed",,,,,,,,,""
2020-11-16 09:49:15.687 UTC,,,5944,"[local]",5fb24b1b.1738,1,"",2020-11-16 09:49:15 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:15.688 UTC,"postgres","postgres",5944,"[local]",5fb24b1b.1738,2,"",2020-11-16 09:49:15 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-11-16 09:49:15.691 UTC,,,5945,"[local]",5fb24b1b.1739,1,"",2020-11-16 09:49:15 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:15.691 UTC,"postgres","postgres",5945,"[local]",5fb24b1b.1739,2,"",2020-11-16 09:49:15 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-11-16 09:49:15.873 UTC,,,5946,"[local]",5fb24b1b.173a,1,"",2020-11-16 09:49:15 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:15.873 UTC,"postgres","postgres",5946,"[local]",5fb24b1b.173a,2,"",2020-11-16 09:49:15 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-11-16 09:49:15.888 UTC,,,5947,,5fb24b1b.173b,1,,2020-11-16 09:49:15 UTC,,0,LOG,00000,"started streaming WAL from primary at 5/2A000000 on timeline 74",,,,,,,,,""
2020-11-16 09:49:15.888 UTC,,,5947,,5fb24b1b.173b,2,,2020-11-16 09:49:15 UTC,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested WAL segment 0000004A000000050000002A has already been removed",,,,,,,,,""
2020-11-16 09:49:20.893 UTC,,,5949,,5fb24b20.173d,1,,2020-11-16 09:49:20 UTC,,0,LOG,00000,"started streaming WAL from primary at 5/2A000000 on timeline 74",,,,,,,,,""
2020-11-16 09:49:20.893 UTC,,,5949,,5fb24b20.173d,2,,2020-11-16 09:49:20 UTC,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested WAL segment 0000004A000000050000002A has already been removed",,,,,,,,,""
2020-11-16 09:49:25.686 UTC,,,5951,"[local]",5fb24b25.173f,1,"",2020-11-16 09:49:25 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:25.686 UTC,"postgres","postgres",5951,"[local]",5fb24b25.173f,2,"",2020-11-16 09:49:25 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-11-16 09:49:25.690 UTC,,,5952,"[local]",5fb24b25.1740,1,"",2020-11-16 09:49:25 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:25.690 UTC,"postgres","postgres",5952,"[local]",5fb24b25.1740,2,"",2020-11-16 09:49:25 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-11-16 09:49:25.897 UTC,,,5953,,5fb24b25.1741,1,,2020-11-16 09:49:25 UTC,,0,LOG,00000,"started streaming WAL from primary at 5/2A000000 on timeline 74",,,,,,,,,""
2020-11-16 09:49:25.898 UTC,,,5953,,5fb24b25.1741,2,,2020-11-16 09:49:25 UTC,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested WAL segment 0000004A000000050000002A has already been removed",,,,,,,,,""
2020-11-16 09:49:26.272 UTC,,,5954,"[local]",5fb24b26.1742,1,"",2020-11-16 09:49:26 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-11-16 09:49:26.272 UTC,"postgres","postgres",5954,"[local]",5fb24b26.1742,2,"",2020-11-16 09:49:26 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""

The problem is pretty clear, the needed WAL has been recycled by the master and cannot be found.

As you might have noticed we are using a custom archive_command to archive our WAL (we don't have access to S3 and we weren't able to configure WAL-E to archive WALs on a file system instead of S3, is there any documentation to support this use case?).

Now we need to also configure the system so that the restore_command to be generated when doing a recovery fetches our WALs from our archive location (and not only on the master).

Is this feasable? We tried to configure the restore command this way:

 postgresql:
      recovery_conf:
        restore_command: '/usr/local/bin/wal-g wal-fetch "%f" "%p" --config /opt/archive-utils/walg-config.json'

But for some reasons that doesn't change anything in the generated restore_command.

I realised that this might be a question for Spilo/Patroni but since we are using the operator and we don't know how to pass parameters to them, I ended up posting this here.

Thaks a lot in advance for any insight.

Best regards,

François

Samusername commented 3 years ago

wal_keep_segments is normally increased in "requested WAL segment ... has already been removed" cases. https://stackoverflow.com/questions/47645487/postgres-streaming-replication-error-requested-wal-segment-has-already-been-rem But I do not know about your restore case / what is root cause there / what should be done there.

haslersn commented 1 year ago

Is it possible to mount a shared filesystem and configure wal-e/wal-g to use that filesystem instead of S3?

zalando / postgres-operator

DB failover recovery fails when too much lag with master - how to configure restore_command? #1231