Open asafo opened 3 years ago
The problem start when useing the master service/endpoint as the name of the statefulset. This leave the slave pod not accessable direcly by name. See https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields. We use mtls in strict mode , and salve pods are not k8s service registry
If the operator will create 3 services:
Also, though less important, Why master service is not using selector to choose the master pod ?
Because Patroni manages the endpoint and therefore can guaranty that there is always not more than one IP. With label selector and readiness liveness probes it is impossible to get such behavior. Example:
One may argue that liveness and readiness probes can help to solve it, but... How do we know where these probes are executed from? Can they really guaranty that? Another point is that the configuration of liveness readiness probes must be aligned with Patroni config: https://patroni.readthedocs.io/en/latest/rest_api.html
What a about using the pod names instead of IP Addresses ? (Assuming we have created headless service for the statefulset to have DNS names)
What a about using the pod names instead of IP Addresses ?
Guess how many times people got into trouble due to misbehaving DNS? No thanks, I played this game, you always supposed to lose...
Also, though less important, Why master service is not using selector to choose the master pod ?
Because Patroni manages the endpoint and therefore can guaranty that there is always not more than one IP. With label selector and readiness liveness probes it is impossible to get such behavior. Example:
- the leader pod became network partitioned and therefore demotes postgres, but can't update labels
- the other pod promotes
- we got into a situation when two pods have labels role=master!
One may argue that liveness and readiness probes can help to solve it, but... How do we know where these probes are executed from? Can they really guaranty that? Another point is that the configuration of liveness readiness probes must be aligned with Patroni config: https://patroni.readthedocs.io/en/latest/rest_api.html
@CyberDem0n
This is an issue with how spilo handles https://github.com/zalando/spilo/blob/9cbf2094d7cf9eb95081c60db540b1d4e793a55f/postgres-appliance/scripts/callback_role.py#L66 label changes and endpoint changes. Each pod is responsible for managing its own labels but only the leader is allowed to change the endpoints. If the leader was the one to manage the labels for all pods then there would never be a moment where there could be 2 pods with labels spilo-role=master
. This would enable exposing the master/replica service safely via a headless service and would allow the pooler to detect an IP change fixing #1928
and would allow the pooler to detect an IP change fixing https://github.com/zalando/postgres-operator/issues/1928
Not really. There always be a race condition and clients (or pgbouncer) will connect to the wrong pod. It is unfortunately unavoidable.
The only way to solve it - use the load balancer for connection routing that periodically calls Patroni API. But, it is important to configure load balancer to close existing connections if the backend isn't healthy. Very simple example of HAProxy config that does it could be found in the Patroni repo: https://github.com/zalando/patroni/blob/master/haproxy.cfg#L23-L25. on-marked-down shutdown-sessions are responsible for the trick.
@CyberDem0n I agree that races are unavoidable and there will be a period of time after a failover that connections to the postgres master are routed to a replica(due to the time it takes to update DNS caches/DNS zones/endpoints etc...). I disagree that the only way to solve the problem is by using a L3/L4 load balancer, pgbouncer is fully capable of handling failovers. Let me make my idea clear, fell free to correct me if I'm wrong.
Assume that the master postgres instance is exposed via a headless service(ClusterIP="None"
) with label selector spilo-role=master
and assume that spillo would update the labels when the endpoints change(the leader would be responsible for managing the labels for all pods) or endpoints for that headless service are managed directly by spilo(like they are now with the existing ClusterIP service).
Pgboucer with dns_zone_check_period
enabled is capable of detecting a change of the IP address on a host by periodically querying the SOA serial of the DNS zone(this requires that pgbouncer connects to a FQDN of the service - <cluster_name>.<namespace>.<svc>.<cluster_domain>
or just <cluster_name>.<namespace>
). If the zone serial changes then pgboucer checks whether the ip address of the host changed. Combined with server_fast_close
this ensures that when the underlying IP of the host changes then all existing DB connections are gracefully disconnected and the app is free to reconnect.
Let me explain step by step what would happen in my configuration during a switchover/failover(based on #1928):
I think this is the simplest way on how to provide an out of the box experience with the operator - right now the operator after a switchover/failover leaves the cluster in an inconsistent state requiring to restart the connection pooler and monitoring for detecting switchovers/failovers :(
What a about using the pod names instead of IP Addresses ?
Guess how many times people got into trouble due to misbehaving DNS? No thanks, I played this game, you always supposed to lose...
Well, everything in k8s is handled via DNS so it would make sense for operator to patroni talk to use them. Today, when istio is in use (and it would be the same with a lot of service meshes), we have 503 error as Istio doesn't "understand" from which it's coming
+1
Hello
We are working with the Postgres Operator on K8s clusters which runs Istio, Having various issues with not using pod names DNS (no headless service for the statefulset)
Also, though less important, Why master service is not using selector to choose the master pod ?
Thanks