Open bboreham opened 7 years ago
From a Slack conversation:
If the time at when a peer was last seen was tracked as part of the peers list, then dead peers could be identified with addition of a heartbeat threshold (e.g. has not reported for 30-60 seconds, and is also either known dead in EC2 response or otherwise not present in ASG response) and they can then be safely removed.
[the trick here is to reclaim the addresses on just one host]
In addition, I am not sure how this could work efficiently but would it be possible to run a script to remove the peer in question automatically when the host is being shut down in an orderly fashion :thinking_face:
[the trick here is to distinguish "shut down forever" from "shut down temporarily"]
We did this for the Kubernetes integration; it was rather a lot of work to coordinate across nodes. See https://github.com/weaveworks/weave/issues/2797 and subsequent fixes.
E.g. when AWS takes away a VM because the auto-scale group is shrinking, or because it was a spot instance.