zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.33k stars 980 forks source link

Using more conventional k8s means - master service with selector & pods hostnames via headless service #1467

Open asafo opened 3 years ago

asafo commented 3 years ago

Hello

We are working with the Postgres Operator on K8s clusters which runs Istio, Having various issues with not using pod names DNS (no headless service for the statefulset)

Also, though less important, Why master service is not using selector to choose the master pod ?

Thanks

AviMoto commented 3 years ago

The problem start when useing the master service/endpoint as the name of the statefulset. This leave the slave pod not accessable direcly by name. See https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields. We use mtls in strict mode , and salve pods are not k8s service registry

AviMoto commented 3 years ago

If the operator will create 3 services:

  1. headless service withe the name of the cluster/statefulset, this will automaticly add hostname to each pod. And also expose patroni and operator api ports
  2. master service/endpoint as it works today but with different name
  3. slave service as it works today
CyberDem0n commented 3 years ago

Also, though less important, Why master service is not using selector to choose the master pod ?

Because Patroni manages the endpoint and therefore can guaranty that there is always not more than one IP. With label selector and readiness liveness probes it is impossible to get such behavior. Example:

  1. the leader pod became network partitioned and therefore demotes postgres, but can't update labels
  2. the other pod promotes
  3. we got into a situation when two pods have labels role=master!

One may argue that liveness and readiness probes can help to solve it, but... How do we know where these probes are executed from? Can they really guaranty that? Another point is that the configuration of liveness readiness probes must be aligned with Patroni config: https://patroni.readthedocs.io/en/latest/rest_api.html

asafo commented 3 years ago

What a about using the pod names instead of IP Addresses ? (Assuming we have created headless service for the statefulset to have DNS names)

CyberDem0n commented 3 years ago

What a about using the pod names instead of IP Addresses ?

Guess how many times people got into trouble due to misbehaving DNS? No thanks, I played this game, you always supposed to lose...

gorbak25 commented 2 years ago

Also, though less important, Why master service is not using selector to choose the master pod ?

Because Patroni manages the endpoint and therefore can guaranty that there is always not more than one IP. With label selector and readiness liveness probes it is impossible to get such behavior. Example:

  1. the leader pod became network partitioned and therefore demotes postgres, but can't update labels
  2. the other pod promotes
  3. we got into a situation when two pods have labels role=master!

One may argue that liveness and readiness probes can help to solve it, but... How do we know where these probes are executed from? Can they really guaranty that? Another point is that the configuration of liveness readiness probes must be aligned with Patroni config: https://patroni.readthedocs.io/en/latest/rest_api.html

@CyberDem0n This is an issue with how spilo handles https://github.com/zalando/spilo/blob/9cbf2094d7cf9eb95081c60db540b1d4e793a55f/postgres-appliance/scripts/callback_role.py#L66 label changes and endpoint changes. Each pod is responsible for managing its own labels but only the leader is allowed to change the endpoints. If the leader was the one to manage the labels for all pods then there would never be a moment where there could be 2 pods with labels spilo-role=master. This would enable exposing the master/replica service safely via a headless service and would allow the pooler to detect an IP change fixing #1928

CyberDem0n commented 2 years ago

and would allow the pooler to detect an IP change fixing https://github.com/zalando/postgres-operator/issues/1928

Not really. There always be a race condition and clients (or pgbouncer) will connect to the wrong pod. It is unfortunately unavoidable.

The only way to solve it - use the load balancer for connection routing that periodically calls Patroni API. But, it is important to configure load balancer to close existing connections if the backend isn't healthy. Very simple example of HAProxy config that does it could be found in the Patroni repo: https://github.com/zalando/patroni/blob/master/haproxy.cfg#L23-L25. on-marked-down shutdown-sessions are responsible for the trick.

gorbak25 commented 2 years ago

@CyberDem0n I agree that races are unavoidable and there will be a period of time after a failover that connections to the postgres master are routed to a replica(due to the time it takes to update DNS caches/DNS zones/endpoints etc...). I disagree that the only way to solve the problem is by using a L3/L4 load balancer, pgbouncer is fully capable of handling failovers. Let me make my idea clear, fell free to correct me if I'm wrong.

Assume that the master postgres instance is exposed via a headless service(ClusterIP="None") with label selector spilo-role=master and assume that spillo would update the labels when the endpoints change(the leader would be responsible for managing the labels for all pods) or endpoints for that headless service are managed directly by spilo(like they are now with the existing ClusterIP service). Pgboucer with dns_zone_check_period enabled is capable of detecting a change of the IP address on a host by periodically querying the SOA serial of the DNS zone(this requires that pgbouncer connects to a FQDN of the service - <cluster_name>.<namespace>.<svc>.<cluster_domain> or just <cluster_name>.<namespace>). If the zone serial changes then pgboucer checks whether the ip address of the host changed. Combined with server_fast_close this ensures that when the underlying IP of the host changes then all existing DB connections are gracefully disconnected and the app is free to reconnect. Let me explain step by step what would happen in my configuration during a switchover/failover(based on #1928):

  1. postgres-2 was the leader
  2. postgres-2 lost leader lock and started to demote itself
  3. postgres-pooler lost connection to postgres-master-svc-headless(postgres-2) and tried to reconnect
  4. postgres-0 got leader lock and started to promote itself
  5. postgres-2 finished restarting before postgres-0
  6. postgres-pooler reconnected to postgres-master-svc-headless(postgres-2)
  7. postgres-0 finished promoting
  8. patroni updated K8S service IP to point to the postgres-0, postgres-master-svc-headless now points to postgres-0
  9. postgres-pooler noticed that the SOA of cluster.local changed(https://github.com/coredns/coredns/blob/95622f4c9ae65de9465b5fb022711e95be4df2ce/plugin/kubernetes/kubernetes.go#L518)
  10. postgres-pooler checks the DNS host(postgres-master-svc-headless) and notices that the underlying IP changed(from postgres-2 to postgres-0)
  11. postgres-pooler disconnects all existing connections to(postgres-2) and connects to postgres-0

I think this is the simplest way on how to provide an out of the box experience with the operator - right now the operator after a switchover/failover leaves the cluster in an inconsistent state requiring to restart the connection pooler and monitoring for detecting switchovers/failovers :(

sylvainOL commented 2 years ago

What a about using the pod names instead of IP Addresses ?

Guess how many times people got into trouble due to misbehaving DNS? No thanks, I played this game, you always supposed to lose...

Well, everything in k8s is handled via DNS so it would make sense for operator to patroni talk to use them. Today, when istio is in use (and it would be the same with a lot of service meshes), we have 503 error as Istio doesn't "understand" from which it's coming

nupis-DanielS commented 1 year ago

+1