yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.83k stars 1.05k forks source link

Cannot add or remove masters when master is dead #5211

Open dobesv opened 4 years ago

dobesv commented 4 years ago

I accidentally put our cluster into a state where a defunct master cannot be removed, and new masters cannot be added. Although this is probably a niche case it would be nice to know if there is a workaround.

The issue arises with these steps (in theory):

  1. Create a cluster with 3 masters
  2. Shutdown and delete all data for one of the masters
  3. Start up a new master with a new name
  4. Add the new master using change_master_config ADD_SERVER
  5. Attempt to remove the master that was shut down using change_master_config ADD_SERVER

At this point, it refuses to allow adding or removing masters from the configuration, giving the error shown in the transcript below.

I wonder if some kind of special case is needed for the raft logic during a REMOVE_SERVER operation such that it does not require the participation of the server being removed?

Perhaps a way to temporarily reduce quorum requirements for the cluster or force reset the whole master config to contain only the current leader would allow me to escape this situation.

[root@yb-master-c-0 yugabyte]# bin/yb-admin -master_addresses $MASTERS list_all_masters
Master UUID                             RPC Host/Port           State           Role
d88307589dd24d67b0e8877ac546efce        yb-master-a-0.yb-masters.yugabyte.svc.cluster.local:7100        NETWORK_ERROR   UNKNOWN
7a4ea8f39d514ccab694bfd35964c595        yb-master-b-0.yb-masters.yugabyte.svc.cluster.local:7100        ALIVE           LEADER
bdb0b20cfa6f42208e0b56455ca8e7e1        yb-master-c-0.yb-masters.yugabyte.svc.cluster.local:7100        ALIVE           FOLLOWER
2f026a5905d74c1694266853e5f81f3b        yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local:7100    ALIVE           FOLLOWER
[root@yb-master-c-0 yugabyte]# MASTERS=yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local:7100,yb-master-c-0.yb-masters.yugabyte.svc.cluster.local:7100,yb-master-b-0.yb-masters.yugabyte.svc.cluster.local:7100,yb-master-a-0.yb-masters.yugabyte.svc.cluster.local:7100
[root@yb-master-c-0 yugabyte]# bin/yb-admin -master_addresses $MASTERS change_master_config REMOVE_SERVER yb-master-a-0.yb-masters.yugabyte.svc.cluster.local 7100
Error: Illegal state (yb/consensus/raft_consensus.cc:2109): Unable to change master config: Leader is not ready for Config Change, can try again. Num peers in transit: 1. Type: REMOVE_SERVER. Has opid: 1. Committed config: opid_index: 224493 peers { permanent_uuid: "d88307589dd24d67b0e8877ac546efce" member_type: VOTER last_known_private_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } } peers { permanent_uuid: "7a4ea8f39d514ccab694bfd35964c595" member_type: VOTER last_known_private_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "bdb0b20cfa6f42208e0b56455ca8e7e1" member_type: VOTER last_known_private_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "2f026a5905d74c1694266853e5f81f3b" member_type: PRE_VOTER last_known_private_addr { host: "yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } }. Pending config: . Current term: 129. Committed op id: 129.253070.
[root@yb-master-c-0 yugabyte]# bin/yb-admin -master_addresses $MASTERS change_master_config ADD_SERVER yb-master-gp2-b-0.yb-masters.yugabyte.svc.cluster.local 7100
Error: Illegal state (yb/consensus/raft_consensus.cc:2109): Unable to change master config: Leader is not ready for Config Change, can try again. Num peers in transit: 1. Type: ADD_SERVER. Has opid: 1. Committed config: opid_index: 224493 peers { permanent_uuid: "d88307589dd24d67b0e8877ac546efce" member_type: VOTER last_known_private_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } } peers { permanent_uuid: "7a4ea8f39d514ccab694bfd35964c595" member_type: VOTER last_known_private_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "bdb0b20cfa6f42208e0b56455ca8e7e1" member_type: VOTER last_known_private_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "2f026a5905d74c1694266853e5f81f3b" member_type: PRE_VOTER last_known_private_addr { host: "yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } }. Pending config: . Current term: 129. Committed op id: 129.253070.
[root@yb-master-c-0 yugabyte]# bin/yb-admin -master_addresses $MASTERS change_master_config REMOVE_SERVER yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local 7100
Error: Illegal state (yb/consensus/raft_consensus.cc:2109): Unable to change master config: Leader is not ready for Config Change, can try again. Num peers in transit: 1. Type: REMOVE_SERVER. Has opid: 1. Committed config: opid_index: 224493 peers { permanent_uuid: "d88307589dd24d67b0e8877ac546efce" member_type: VOTER last_known_private_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1a" } } peers { permanent_uuid: "7a4ea8f39d514ccab694bfd35964c595" member_type: VOTER last_known_private_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-b-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1b" } } peers { permanent_uuid: "bdb0b20cfa6f42208e0b56455ca8e7e1" member_type: VOTER last_known_private_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } last_known_broadcast_addr { host: "yb-master-c-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } cloud_info { placement_cloud: "aws" placement_region: "us-east-1" placement_zone: "us-east-1c" } } peers { permanent_uuid: "2f026a5905d74c1694266853e5f81f3b" member_type: PRE_VOTER last_known_private_addr { host: "yb-master-gp2-a-0.yb-masters.yugabyte.svc.cluster.local" port: 7100 } }. Pending config: . Current term: 129. Committed op id: 129.253070.
pruiz commented 3 years ago

@rahuldesirazu I face the same situation yesterday, and got to this workaround:

This way this new master will join others (normally as NOT_PARTICIPANT), and this will allow raft consensus to complete. From there on, you can now remove this 'incompleter' using yb-admin ... change_master_config REMOVE_SERVER $bad-master$, and proceed to actually initialize it right (ie. starting it in shell mode, etc)

Best regards Pablo