zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

updated leader lock during demoting self because DCS is not accessible and I was a leader #2593

Closed anyafit closed 7 months ago

anyafit commented 7 months ago

We have 3 node postgresql on different location, and etcd cluster in one of this location. Postgresql leader lost connection to etcd and replica in etcd locations:

2024-03-22 14:19:22,085 INFO: Lock owner: compute-1; I am compute-1 2024-03-22 14:19:24,088 ERROR: Request to server http://etcd-1.com:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='etcd-1.com', port=2379): Read timed out. (read timeout=1.9998452216386795)",) 2024-03-22 14:19:24,088 INFO: Reconnection allowed, looking for another server. 2024-03-22 14:19:24,088 INFO: Retrying on http://etcd-2.com:2379 .... 2024-03-22 14:19:31,862 ERROR: Error communicating with DCS 2024-03-22 14:19:31,863 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) 2024-03-22 14:19:32,042 INFO: Got response from compute-2 http://compute-2:8006/patroni: Accepted ..... 2024-03-22 14:19:35,874 WARNING: Request failed to compute-3: POST http://compute-3:8006/patroni (HTTPConnectionPool(host='compute-3', port=8006): Max retries exceeded with url: /failsafe (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7ff7b270c6d8>, 'Connection to compute-3 timed out. (connect timeout=2)'))) 2024-03-22 14:19:35,973 INFO: demoting self because DCS is not accessible and I was a leader 2024-03-22 14:19:35,973 INFO: Demoting self (offline)

But after demoting connection with DCS has been established, and patroni update leader lock until postgres turned off (in our case it take 10 minutes)

2024-03-22 14:19:38,757 INFO: Reconnection allowed, looking for another server. 2024-03-22 14:19:38,757 INFO: Retrying on http://etcd-2.com:2379 2024-03-22 14:19:38,992 INFO: Selected new etcd server http://etcd-2.com:2379 2024-03-22 14:19:39,190 INFO: Lock owner: compute-1; I am compute-1 2024-03-22 14:19:39,628 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader 2024-03-22 14:19:45,974 INFO: Lock owner: compute-1; I am compute-1 2024-03-22 14:19:46,173 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader .....

Looks like bug. Maybe patroni don't update leader lock after demoting because DCS is not accesible?

anyafit commented 7 months ago

sorry, other project