zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.34k stars 979 forks source link

Patroni taking too long for failover #1145

Open BenchmarkingBuffalo opened 4 years ago

BenchmarkingBuffalo commented 4 years ago

Please, answer some short questions which should help us to understand your problem / question better?

2020-09-22 13:34:23,786 INFO: Lock owner: acid-minimal-cluster-0; I am acid-minimal-cluster-1 2020-09-22 13:34:23,786 INFO: does not have lock 2020-09-22 13:34:23,860 INFO: no action. i am a secondary and i am following a leader 2020-09-22 13:34:23,862 WARNING: Loop time exceeded, rescheduling immediately. 2020-09-22 13:34:25,386 WARNING: Request failed to acid-minimal-cluster-0: GET http://10.36.0.1:8008/patroni (HTTPConnectionPool(host='10.36.0.1', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe1b8fe3828>: Failed to establish a new connection: [Errno 113] No route to host',))) 2020-09-22 13:34:25,520 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" 2020-09-22 13:34:25,582 INFO: promoted self to leader by acquiring session lock 2020-09-22 13:34:25,584 WARNING: Loop time exceeded, rescheduling immediately. 2020-09-22 13:34:25,584 INFO: Lock owner: acid-minimal-cluster-1; I am acid-minimal-cluster-1 2020-09-22 13:34:25,634 INFO: updated leader lock during promote server promoting 2020-09-22 13:34:25,671 INFO: cleared rewind state after becoming the leader


As you can see, I disconnected the network at 13:33:01 and there were no more logs for almost a minute. 
Then a timeout was reached (I dont know how I can change the timeout to a shorter time).
After 25 more seconds, the node started to promote itself.
Is there a way I can reduce this amount of time? 
What I basically want is the former standby to promote itself to the master as soon as the master does not renew his lock. 
CyberDem0n commented 4 years ago

After 25 more seconds, the node started to promote itself. Is there a way I can reduce this amount of time? What I basically want is the former standby to promote itself to the master as soon as the master does not renew his lock.

The promotion happens when the leader doesn't update the lock for 30 seconds (ttl). It is possible to reduce ttl, but I would hardly advise you not to do that. What if it is a temporary network glitch that will resolve itself soon?

BenchmarkingBuffalo commented 4 years ago

Hi, I changed the ttl to two seconds (see the manifest above), so that can’t be it or am I wrong?

CyberDem0n commented 4 years ago

Please, don't do it! No one could beat laws of physics!

If one changes ttl, loop_wait and retry_timeout also should be adjusted. There is a formula which must hold: loop_wait + 2*retry_timeout <= ttl: https://patroni.readthedocs.io/en/latest/SETTINGS.html

Besides that, there are some hardcoded timeouts, it is absolutely unsafe to go below 20 sec!

BenchmarkingBuffalo commented 4 years ago

Ok, thank you for your reply. I had already adjusted the other values as well, but I did not know about the hardcoded timeouts.