Closed CyberDem0n closed 8 years ago
Manual regression test, to test failure scenario Network Partition:
3 node cluster, status:
cluster is healthy
member af90e28ea0bc9d03 is healthy
member e32468c722a28bd4 is healthy
member ff4e0f5a2f7eb641 is unhealthy
Add new node, by bumping Auto Scaling Group from 3 to 4:
The cluster adds a member, and becomes unhealthy for a short while, after that, the cluster is healthy again:
cluster is healthyh
member 231a86654ad1f632 is healthy
member af90e28ea0bc9d03 is healthy
member e32468c722a28bd4 is healthy
member ff4e0f5a2f7eb641 is unhealthy
My current hypothesis of this happening is:
Tested again, by doing the following:
+1
Before sending add member command to an existing cluster we should check that there is nobody already in process of adding itself to the cluster.
Basically this is just workaround for the following problem: https://github.com/zalando/stups-etcd-cluster/issues/1
New members were added successfully but etcd failed to start due to version incompatibility and we lost quorum.