zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.33k stars 979 forks source link

Technical Issue: Connection refused to standby leader using external etcd #2393

Open PGPAWAN opened 1 year ago

PGPAWAN commented 1 year ago

Issue: Patroni role is showing standby_leader when trying to use external etcd and getting connection refused with Node port.

Whereas when trying to connect with K8 internal etcd, its is showing role as master and able to make external connection as well.

In case

ETCD

(⎈ |stg-postgres-cluster-v2)| % kubectl get pods -n postgres-etcd
NAME     READY   STATUS    RESTARTS   AGE
etcd-0   1/1     Running   0          26d
etcd-1   1/1     Running   0          26d
etcd-2   1/1     Running   0          26d
etcd-3   1/1     Running   0          26d
etcd-4   1/1     Running   0          26d

PRIMARY DB CLUSTER ( Kubernetes Internal DCS on data center 1 )

(⎈ |stg-postgres-cluster-v2)| % kubectl get pods  -l application=spilo -L spilo-role  -n pg-pgteststage  
NAME               READY   STATUS    RESTARTS   AGE   SPILO-ROLE
pg-pgteststage-0   2/2     Running   0          24d   master
pg-pgteststage-1   2/2     Running   0          24d   replica

STANDBY CLUSTER USING S3 ( External DCS on data center 2)

(⎈ |stg-postgres-cluster-v2)| % kubectl get pods  -l application=spilo -L spilo-role  -n pg-pgteststage2  
NAME                READY   STATUS    RESTARTS   AGE   SPILO-ROLE
pg-pgteststage2-0   2/2     Running   0          16m   **standby_leader**

Endpoint:(Endpoint not registered)

(⎈ |stg-postgres-cluster-v2)| ~ % kg ep
NAME                   ENDPOINTS   AGE
pg-pgteststage2        <none>      7d17h
pg-pgteststage2-repl   <none>      7d17h

(⎈ |stg-postgres-cluster-v2)| % k describe ep pg-pgteststage2
Name:         pg-pgteststage2
Namespace:    pg-pgteststage2
Labels:       application=spilo
              cluster-name=pg-pgteststage2
              spilo-role=master
              team=pg
Annotations:  <none>
Subsets:
Events:  <none>

POD LABEL:

(⎈ |stg-postgres-cluster-v2)| % k describe pod pg-pgteststage2-0
Name:         pg-pgteststage2-0
Namespace:    pg-pgteststage2
Priority:     0
Node:         stg-postgres-v2-worker21
Start Time:   Mon, 14 Aug 2023 17:21:55 +0530
Labels:       application=spilo
              cluster-name=pg-pgteststage2
              controller-revision-hash=pg-pgteststage2-6f6c795fcf
              spilo-role=standby_leader
              statefulset.kubernetes.io/pod-name=pg-pgteststage2-0
              team=pg

ETCD LOG:

/ # etcdctl get /service/pg-pgteststage2/members/pg-pgteststage2-0
{"conn_url":"postgres://10.22.82.84:5432/postgres","api_url":"http://10.22.82.84:8008/patroni","state":"running","role":"standby_leader","version":"2.1.4","checkpoint_after_promote":false,"xlog_location":5016387584,"timeline":4}

Postgres Operator Log:

time="2023-08-14T11:44:20Z" level=info msg="ADD event has been queued" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=controller worker=4
time="2023-08-14T11:51:43Z" level=info msg="creating a new Postgres cluster" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=controller worker=4
time="2023-08-14T11:51:44Z" level=warning msg="master is not running, generated master endpoint does not contain any addresses" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:45Z" level=info msg="endpoint \"pg-pgteststage2/pg-pgteststage2\" has been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:45Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:46Z" level=info msg="master service \"pg-pgteststage2/pg-pgteststage2\" has been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:46Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:47Z" level=info msg="replica service \"pg-pgteststage2/pg-pgteststage2-repl\" has been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:47Z" level=debug msg="fetching possible additional team members for team \"pg\"" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:47Z" level=debug msg="team API is disabled" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:47Z" level=info msg="users have been initialized" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:47Z" level=info msg="syncing secrets" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:52Z" level=info msg="secrets have been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:52Z" level=info msg="pod disruption budget \"pg-pgteststage2/postgres-pg-pgteststage2-pdb\" has been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:52Z" level=info msg="standby cluster streaming from WAL location" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:53Z" level=info msg="Mount additional volumes: [{Name:pg-pgteststage2-tls MountPath:/tls SubPath: TargetContainers:[postgres] VolumeSource:{HostPath:nil EmptyDir:nil GCEPersistentDisk:nil AWSElasticBlockStore:nil GitRepo:nil Secret:&SecretVolumeSource{SecretName:pg-pgteststage2-tls,Items:[]KeyToPath{},DefaultMode:*416,Optional:nil,} NFS:nil ISCSI:nil Glusterfs:nil PersistentVolumeClaim:nil RBD:nil FlexVolume:nil Cinder:nil CephFS:nil Flocker:nil DownwardAPI:nil FC:nil AzureFile:nil ConfigMap:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil Projected:nil PortworxVolume:nil ScaleIO:nil StorageOS:nil CSI:nil Ephemeral:nil}}]" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:53Z" level=debug msg="created new statefulset \"pg-pgteststage2/pg-pgteststage2\", uid: \"43e81e54-4594-4966-986c-c1774b27a869\"" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:53Z" level=info msg="statefulset \"pg-pgteststage2/pg-pgteststage2\" has been successfully created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:53Z" level=info msg="waiting for the cluster being ready" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:51:56Z" level=debug msg="Waiting for 1 pods to become ready" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="pods are ready" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found pod disruption budget: \"pg-pgteststage2/postgres-pg-pgteststage2-pdb\" (uid: \"306d8ba2-7475-41ff-85cd-642bc632246a\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found statefulset: \"pg-pgteststage2/pg-pgteststage2\" (uid: \"43e81e54-4594-4966-986c-c1774b27a869\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found secret: \"pg-pgteststage2/postgres.pg-pgteststage2.credentials.postgresql.acid.zalan.do\" (uid: \"9a5852b5-aaab-49db-b285-db23d02fa728\") namesapce: pg-pgteststage2" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found secret: \"pg-pgteststage2/standby.pg-pgteststage2.credentials.postgresql.acid.zalan.do\" (uid: \"f2204949-e5b1-4386-96b3-8ed30eb72c98\") namesapce: pg-pgteststage2" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found secret: \"pg-pgteststage2/pgteststage2.pg-pgteststage2.credentials.postgresql.acid.zalan.do\" (uid: \"34df9c1d-03d4-46b6-ad42-68f928000138\") namesapce: pg-pgteststage2" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found master endpoint: \"pg-pgteststage2/pg-pgteststage2\" (uid: \"2acf17ac-82e9-408c-85ab-b1b7ec6f7787\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found master service: \"pg-pgteststage2/pg-pgteststage2\" (uid: \"c6df988e-74b7-42e3-b5cf-811c6e11f794\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found replica service: \"pg-pgteststage2/pg-pgteststage2-repl\" (uid: \"cd5acde3-ab36-49b5-b4c0-191fc61d1148\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found pod: \"pg-pgteststage2/pg-pgteststage2-0\" (uid: \"a0ea5a57-938f-4227-b3f5-5c3b2ddfc271\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="found PVC: \"pg-pgteststage2/pgdata-pg-pgteststage2-0\" (uid: \"f9c14294-606c-4735-93e7-f43068b981cf\")" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=cluster worker=4
time="2023-08-14T11:52:35Z" level=info msg="cluster has been created" cluster-name=pg-pgteststage2/pg-pgteststage2 pkg=controller worker=4
PGPAWAN commented 1 year ago

Why we don't have below parameter in patroni yaml when we are using external ETCD endpoint.

Internal ETCD with K8 cluster.

users:
    zalandos:
      options:
      - CREATEDB
      - NOLOGIN
      password: ''
kubernetes:
  bypass_api_service: true
  labels:
    application: spilo
  port: tcp://10.23.0.1:443
  port_443_tcp: tcp://10.23.0.1:443
  port_443_tcp_addr: 10.23.0.1
  port_443_tcp_port: '443'
  port_443_tcp_proto: tcp
  ports:
  - name: postgresql
    port: 5432
  role_label: spilo-role
  scope_label: cluster-name
  service_host: 10.23.0.1
  service_port: '443'
  service_port_https: '443'
  use_endpoints: true
namespace: pg-pgteststage2
postgresql:
  authentication:
    replication:

External Etcd.

  users:
    zalandos:
      options:
      - CREATEDB
      - NOLOGIN
      password: ''
etcd:
  host: etcd.postgres-etcd1.svc.cluster.local:2379
postgresql:
  authentication:
    replication: