weaveworks / service

☁️ Images for Weave Cloud (R) (TM) (C) ☁️
https://cloud.weave.works
2 stars 2 forks source link

authfe: Don't stop load balancing just because pods restart #2726

Closed ozamosi closed 3 years ago

ozamosi commented 3 years ago

When a pod is restarted (not as in deleted and recreated, but as in process killed pod continues), consistent would delete the node from algo_consistent. Then when consistent discovered it was started again, it would not tell algo_consistent about that, because the entry was already in cache.

This cleans up the cache, so we actually continue to load balance even when the cluster is struggling.

ozamosi commented 3 years ago

When searching for messages about a specific pod having started, I found that it did Saturday morning: Screenshot from 2021-09-20 13-13-55 Why? Well it clearly wasn't a new pod - searching for the same container, but without the "starting" filter, shows it having been started long before that: Screenshot from 2021-09-20 13-15-17

In fact, every scope restarted at that time (prom graphs&g1.tab=0&g2.range_input=10m&g2.end_input=2021-09-19%2004%3A39&g2.expr=sum%20by%20(pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer%3D%22query%22%7D%5B1m%5D))&g2.tab=0)) (why? Don't know - possibly whatever caused the first one to crash was retried on the others, one by one), and lead to this stair case: Screenshot from 2021-09-20 13-17-58

Looking at the same 10 minutes as above, you see 1 set of IDs (container IDs?) be rotated out, and the new ones never actually being "turned on" or receiving traffic - curious! Screenshot from 2021-09-20 13-21-58

Yet when you replace that long, full ID by pod ID, you'll notice it's the same pod ID, that just seemingly drops out. Screenshot from 2021-09-20 13-22-12

I believe this means that as it's the same pod, it's the same IP returned from the queryh service, so it looks like it went unhealthy and then back to healthy - so this should also happen for unhealthy services!

So I went to dev, and I killed the query process in one container, waited a couple of minutes for the graphs to update, then went to the other. I had exactly 50% success rate at reproducing the issue: Screenshot from 2021-09-20 13-25-58

So it looks like this is a race condition - if the pod restarted in the 5 seconds between the DNS polls, it remained healthy, if it was unhealthy at the time, then it's unhealthy forever.

ozamosi commented 3 years ago

Afterwards, in dev - seems to work: Screenshot from 2021-09-20 14-20-40