Closed ozamosi closed 3 years ago
When searching for messages about a specific pod having started, I found that it did Saturday morning: Why? Well it clearly wasn't a new pod - searching for the same container, but without the "starting" filter, shows it having been started long before that:
In fact, every scope restarted at that time (prom graphs&g1.tab=0&g2.range_input=10m&g2.end_input=2021-09-19%2004%3A39&g2.expr=sum%20by%20(pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer%3D%22query%22%7D%5B1m%5D))&g2.tab=0)) (why? Don't know - possibly whatever caused the first one to crash was retried on the others, one by one), and lead to this stair case:
Looking at the same 10 minutes as above, you see 1 set of IDs (container IDs?) be rotated out, and the new ones never actually being "turned on" or receiving traffic - curious!
Yet when you replace that long, full ID by pod ID, you'll notice it's the same pod ID, that just seemingly drops out.
I believe this means that as it's the same pod, it's the same IP returned from the queryh service, so it looks like it went unhealthy and then back to healthy - so this should also happen for unhealthy services!
So I went to dev, and I killed the query process in one container, waited a couple of minutes for the graphs to update, then went to the other. I had exactly 50% success rate at reproducing the issue:
So it looks like this is a race condition - if the pod restarted in the 5 seconds between the DNS polls, it remained healthy, if it was unhealthy at the time, then it's unhealthy forever.
Afterwards, in dev - seems to work:
When a pod is restarted (not as in deleted and recreated, but as in process killed pod continues), consistent would delete the node from algo_consistent. Then when consistent discovered it was started again, it would not tell algo_consistent about that, because the entry was already in cache.
This cleans up the cache, so we actually continue to load balance even when the cluster is struggling.