yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.88k stars 1.05k forks source link

Tablet server stuck in a loop processing a peer failure when there is a config change in progress #1345

Open mbautin opened 5 years ago

mbautin commented 5 years ago

Jira Link: DB-1665 Seeing this on a tablet server:

raft_consensus.cc:1055] T 4de597fb20a3433f91f403e821b71f5b P b528edb8fdee46d9bad546e8104b046d [term 113634 LEADER]: Processing failure of peer d541c80109f04a3aa6caa2c4d5275d73 in term 113634 (The logs necessary to catch up peer d541c80109f04a3aa6caa2c4d5275d73 have been garbage collected. The follower will never be able to catch up (Not found (yb/consensus/log_reader.cc:303): Failed to read ops 31625755..31637725: Segment 180 which contained index 31625755 has been GCed)): There is already a config change operation in progress. Unable to evict follower until it completes. Doing nothing.

The code version is b772f48df38e2892085a64e8e6933904790db2ec

mbautin commented 5 years ago
void RaftConsensus::NotifyFailedFollower(const string& uuid,
                                         int64_t term,
                                         const std::string& reason) {
  // Common info used in all of the log messages within this method.
  string fail_msg = Substitute("Processing failure of peer $0 in term $1 ($2): ",
                               uuid, term, reason);

  if (!FLAGS_evict_failed_followers) {
    LOG_WITH_PREFIX(INFO) << fail_msg << "Eviction of failed followers is disabled. Doing nothing.";
    return;
  }

  RaftConfigPB committed_config;
  {
    auto lock = state_->LockForRead();

    int64_t current_term = state_->GetCurrentTermUnlocked();
    if (current_term != term) {
      LOG_WITH_PREFIX(INFO) << fail_msg << "Notified about a follower failure in "
                            << "previous term " << term << ", but a leader election "
                            << "likely occurred since the failure was detected. "
                            << "Doing nothing.";
      return;
    }

    if (state_->IsConfigChangePendingUnlocked()) {
      LOG_WITH_PREFIX(INFO) << fail_msg << "There is already a config change operation "  // <-- LOGGING HERE
                            << "in progress. Unable to evict follower until it completes. "
                            << "Doing nothing.";
      return;
    }
    committed_config = state_->GetCommittedConfigUnlocked();
  }

  // Run config change on thread pool after dropping ReplicaState lock.
  WARN_NOT_OK(raft_pool_token_->SubmitFunc(std::bind(&RaftConsensus::TryRemoveFollowerTask,
                                               shared_from_this(), uuid, committed_config, reason)),
              state_->LogPrefix() + "Unable to start RemoteFollowerTask");
}