zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.2k stars 963 forks source link

Backup fails when sidecars are in use #1281

Open haf opened 3 years ago

haf commented 3 years ago

When you're using sidecars that use iptables to intercept traffic, the logical backup is failing. Normally, I can configure an operator with a k8s section that overrides the args to wait on the Istio sidecar proxy to be up and responsive, but I haven't found a way except rebuilding the image, to do that here. This means the logical backups fail when trying to contact the k8s API server (connection refused), and there are no retries implemented (which would be another solution).

Solutions thus could be;

  1. enable randomised exponential backoff for all network requests from the backup container to automatically wait for the sidecar to be up and running
  2. enable the operator to merge in (override) bits of the job spec in the operator configuration manifest, so I can prepend a loop that awaits network connectivity, like so:
        command: ["sh", "-c"]
        args:
        - |
          set -e
          trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
          while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
          exec /dump.sh
$ k logs logical-backup-app-analytics-db-1609633800-586qh  -c logical-backup
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 172.16.16.1 port 443: Connection refused

Refs

haf commented 3 years ago

Ping

cristi-vlad commented 3 years ago

Same issue on my side. If i enable istio injection, logical backup fails to reach api server

logical-backup-1-1623371400-t9cc2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 10.100.200.1 port 443: Connection refused

cristi-vlad commented 3 years ago

Problem found. Envoy proxy is starting faster than logical backup pod. Setting values.global.proxy.holdApplicationUntilProxyStarts into isitooperator solves the problem.