[docdb] TS crash due to 2 tablets in cache with the same partition_key_start

Jira Link: DB-1920 TS crashed in MetaCache::ProcessTabletLocations() (meta_cache.cc:722) with the fatal log:

F1228 18:29:00.744967 25713 meta_cache.cc:722] Check failed: tablets_by_key.emplace(partition.partition_key_start(), remote).second

The issue can affect CQL & SQL, because the failed MetaCache is used in the yb::client code to get Tablet Location info in any tablet read/write operation (in AsyncRpc via TabletInvoker via YBClient::LookupTabletById()).

Log:

W1228 18:27:11.434542  7247 remote_bootstrap_service.cc:117] Error handling RemoteBootstrapService RPC request from 10.20.67.46:60189: Not found (yb/tserver/remote_bootstrap_session.cc:488): Unable to get piece of log segment: Already sent active log segment, don't send 2
W1228 18:27:11.449787 25344 consensus_peers.cc:424] T 30ef75e22b864fdd8bcad80fbf63e768 P b2ac21dfaa474840b5b3e57bc64cea16 -> Peer c9d47b5f21bd46e186d3cb236b014da3 ([host: "10.20.67.46" port: 9101], []): Couldn't send request.  Status: Illegal state (yb/tserver/service_util.h:125): Tablet not RUNNING: BOOTSTRAPPING. Retrying in the next heartbeat period. Already tried 1 times. State: 2
W1228 18:27:13.464529 25347 remote_bootstrap_service.cc:117] Error handling RemoteBootstrapService RPC request from 10.20.67.50:42331: Not found (yb/tserver/remote_bootstrap_session.cc:488): Unable to get piece of log segment: Already sent active log segment, don't send 2
W1228 18:27:15.523027  7404 log_util.cc:211] Could not read footer for segment: /usr/scratch/yugabyte/2.0.8.0/data/yb-data/tserver/wals/table-0000400100003000800000000000401d/tablet-5688f5b6aa13400b915482ef28512cd1/wal-000000001: Not found (yb/consensus/log_util.cc:446): Footer not found. Footer magic doesn't match
W1228 18:27:15.523083  7404 log_reader.cc:182] T 5688f5b6aa13400b915482ef28512cd1 P b2ac21dfaa474840b5b3e57bc64cea16: Log segment /usr/scratch/yugabyte/2.0.8.0/data/yb-data/tserver/wals/table-0000400100003000800000000000401d/tablet-5688f5b6aa13400b915482ef28512cd1/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
W1228 18:27:18.961344  7149 reactor.cc:380] TabletServer_R001: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 10.20.67.48:9101 idx: 2 protocol: 0x00007fa783506d98 -> tcp }
W1228 18:27:18.961792  7150 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 10.20.67.50:9101 idx: 0 protocol: 0x00007fa783506d98 -> tcp }
W1228 18:27:18.961882  7150 reactor.cc:380] TabletServer_R002: Client call yb.consensus.ConsensusService.RequestConsensusVote has no timeout set for connection id: { remote: 10.20.67.49:9101 idx: 6 protocol: 0x00007fa783506d98 -> tcp }
W1228 18:28:33.910887  7250 cql_processor.cc:263] Unsupported driver option DRIVER_VERSION = 3.19.0
F1228 18:29:00.744967 25713 meta_cache.cc:722] Check failed: tablets_by_key.emplace(partition.partition_key_start(), remote).second

Failed yb::client code (which works in TServer - in a CQL/SQL Service, to prepare/process RPC requests from TS to Master).

// Manager of RemoteTablets and RemoteTabletServers. The client consults
// this class to look up a given tablet or server.
//
// This class will also be responsible for cache eviction policies, etc.
class MetaCache : public RefCountedThreadSafe<MetaCache> {
...
    std::unordered_map<PartitionKey, RemoteTabletPtr> tablets_by_partition;

RemoteTabletPtr MetaCache::ProcessTabletLocations(
    const google::protobuf::RepeatedPtrField<master::TabletLocationsPB>& locations,
    const std::string* partition_group_start) {
  VLOG(2) << "Processing master response " << ToString(locations);
...
    std::lock_guard<decltype(mutex_)> l(mutex_);
    for (const TabletLocationsPB& loc : locations) {
      for (const std::string& table_id : loc.table_ids()) {
        auto& table_data = tables_[table_id];
        auto& tablets_by_key = table_data.tablets_by_partition;    <<<<<<<<<<<<
...
        // Next, update the tablet caches.
        const std::string& tablet_id = loc.tablet_id();
        RemoteTabletPtr remote = FindPtrOrNull(tablets_by_id_, tablet_id);
        if (remote) {            <<<<<<<<<<<<<<  == NULL - 'tablet_id' was NOT found
          // Partition should not have changed.
          DCHECK_EQ(loc.partition().partition_key_start(),
                    remote->partition().partition_key_start());
          DCHECK_EQ(loc.partition().partition_key_end(),
                    remote->partition().partition_key_end());
          VLOG(3) << "Refreshing tablet " << tablet_id << ": " << loc.ShortDebugString();
        } else {
          VLOG(3) << "Caching tablet " << tablet_id << ": " << loc.ShortDebugString();
          Partition partition;
          Partition::FromPB(loc.partition(), &partition);
          remote = new RemoteTablet(tablet_id, partition);
          CHECK(tablets_by_id_.emplace(tablet_id, remote).second);
<<<<<<< OK. Successfully added by ID
          CHECK(tablets_by_key.emplace(partition.partition_key_start(), remote).second);
<<<<<<<<< FALSE - means the pair was not inserted into the map. There is already such key (=partition.partition_key_start()) in the map.
        }

TS gets info (Partitions for all tablets) from Master - per table_id + per tablet_id partitions info. The failed tablet is new on the TS (due to RemoteBootStrap above in the log) - it was not found in the map by id. The tablet was successfully added into the map by id. Then the tablet must be added for this table into the second map - map by 'partition_key_start', but the line was failed. So, there is another Tablet in this table with the same 'partition_key_start'. That's unexpected case. That means 2 tablets (for the same table) with the same 'partition_key_start'. I cannot say right now how it can happen (2 tablets with the same start partition key). May be the first tablet was deleted and a new one was added instead, but the meta cache on TS was not updated to remove the first tablet in time. Anyway it needs more investigation and a reproducer would help very much. I can guess that this manipulations with tablets are related to/originated by the RemoteBootstrap service. Possibly the case is related to the scenario when the the RemoteBootstrap failed (as we see in the log).

A few other log warnings (may be related to the issue, may be not): 1) Bootstrapping tablet in wrong (not running) state:

W1228 18:27:11.449787 25344 consensus_peers.cc:424] T 30ef75e22b864fdd8bcad80fbf63e768 P b2ac21dfaa474840b5b3e57bc64cea16 -> Peer c9d47b5f21bd46e186d3cb236b014da3 ([host: "10.20.67.46" port: 9101], []): Couldn't send request.  Status: Illegal state (yb/tserver/service_util.h:125): Tablet not RUNNING: BOOTSTRAPPING. Retrying in the next heartbeat period. Already tried 1 times. State: 2
W1228 18:27:13.464529 25347 remote_bootstrap_service.cc:117] Error handling RemoteBootstrapService RPC request from 10.20.67.50:42331: Not found (yb/tserver/remote_bootstrap_session.cc:488): Unable to get piece of log segment: Already sent active log segment, don't send 2

2) Broken WAL log - due to previous TS crash I guess - we can find other crash logs on the disk:

W1228 18:27:15.523027  7404 log_util.cc:211] Could not read footer for segment: /usr/scratch/yugabyte/2.0.8.0/data/yb-data/tserver/wals/table-0000400100003000800000000000401d/tablet-5688f5b6aa13400b915482ef28512cd1/wal-000000001: Not found (yb/consensus/log_util.cc:446): Footer not found. Footer magic doesn't match
W1228 18:27:15.523083  7404 log_reader.cc:182] T 5688f5b6aa13400b915482ef28512cd1 P b2ac21dfaa474840b5b3e57bc64cea16: Log segment /usr/scratch/yugabyte/2.0.8.0/data/yb-data/tserver/wals/table-0000400100003000800000000000401d/tablet-5688f5b6aa13400b915482ef28512cd1/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.

3) The driver version is not expected:

W1228 18:28:33.910887  7250 cql_processor.cc:263] Unsupported driver option DRIVER_VERSION = 3.19.0

Note: used default Cassandra (not YB) driver.

Related discussion in Slack: https://yugabyte-db.slack.com/archives/GRY7P7LTG/p1581602620108600

Additional logs:

I0212 22:37:54.056057 32508 memtable_list.cc:374] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04 [R]: [default] Level-0 commit table #27 started
I0212 22:37:54.056080 32508 memtable_list.cc:390] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04 [R]: [default] Level-0 commit table #27: memtable #1 done
I0212 22:37:54.056084 32508 event_logger.cc:77] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04 [R]: EVENT_LOG_v1 {"time_micros": 1581547074055590, "job": 18, "event": "flush_finished", "lsm_state": [3]}
I0212 22:37:54.196646 31959 log.cc:861] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04: Running Log GC on /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/wals/table-a0a03ddea7ea4120a589bacd62221af8/tablet-5802b7fa1aa94df79be06d6cd88e9ab9: retaining ops >= 157786, log segment size = 67108864
I0212 22:37:54.196723 31959 log_reader.cc:521] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04: Removed log segment sequence numbers from log reader: [28]
I0212 22:37:54.196746 31959 log.cc:886] T 5802b7fa1aa94df79be06d6cd88e9ab9 P 0f8027232bba4785903285de35749b04: Deleting log segment in path: /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/wals/table-a0a03ddea7ea4120a589bacd62221af8/tablet-5802b7fa1aa94df79be06d6cd88e9ab9/wal-000000028 (GCed ops < 157786)
I0212 22:37:54.208775 31959 maintenance_manager.cc:373] Time spent running LogGCOp(5802b7fa1aa94df79be06d6cd88e9ab9): real 0.012s   user 0.000s sys 0.011s
I0212 22:49:59.318825 13275 log.cc:561] T 18de408a722745f891ee72a816531bd6 P 0f8027232bba4785903285de35749b04: Max segment size 67108864 reached. Starting new segment allocation. 
I0212 22:49:59.324448 13275 log.cc:469] T 18de408a722745f891ee72a816531bd6 P 0f8027232bba4785903285de35749b04: Rolled over to a new segment: /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/wals/table-a0a03ddea7ea4120a589bacd62221af8/tablet-18de408a722745f891ee72a816531bd6/wal-000000083
W0212 22:50:10.751219 11541 cql_processor.cc:263] Unsupported driver option DRIVER_VERSION = 3.19.0
F0212 22:50:12.256759 11926 meta_cache.cc:722] Check failed: tablets_by_key.emplace(partition.partition_key_start(), remote).second

I0212 22:35:53.869737 20780 call_home.cc:392] Done with collector RpcsCollector
I0212 22:35:53.874092 20780 call_home.cc:392] Done with collector GFlagsCollector
I0212 22:36:24.353956 20780 call_home.cc:436] Unable to send diagnostics data to http://diagnostics.yugabyte.com
W0212 22:50:10.778905 22453 cql_processor.cc:263] Unsupported driver option DRIVER_VERSION = 3.19.0
F0212 22:50:13.353415 20892 meta_cache.cc:722] Check failed: tablets_by_key.emplace(partition.partition_key_start(), remote).second

I0212 22:51:21.102425 16657 server_base.cc:484] Using private ip address 10.20.67.45
I0212 22:51:21.102555 16657 ts_tablet_manager.cc:1070] Loading metadata for tablet 74501e39774d4024acc2dc40a210fca1
I0212 22:51:21.105792 16657 ts_tablet_manager.cc:1611] T 74501e39774d4024acc2dc40a210fca1 P 0f8027232bba4785903285de35749b04: Tablet Manager startup: Rolling forward tablet deletion of type TABLET_DATA_DELETED
I0212 22:51:21.105820 16657 ts_tablet_manager.cc:1854] T 74501e39774d4024acc2dc40a210fca1 P 0f8027232bba4785903285de35749b04: Deleting tablet data with delete state TABLET_DATA_DELETED
I0212 22:51:21.105932 16657 docdb_rocksdb_util.cc:414] Auto setting FLAGS_rocksdb_max_background_flushes to 3
I0212 22:51:21.105960 16657 docdb_rocksdb_util.cc:434] Auto setting FLAGS_rocksdb_max_background_compactions to 3
I0212 22:51:21.105965 16657 docdb_rocksdb_util.cc:443] Auto setting FLAGS_rocksdb_base_background_compactions to 3
I0212 22:51:21.105971 16657 docdb_rocksdb_util.cc:454] Auto setting FLAGS_priority_thread_pool_size to 4
I0212 22:51:21.106109 16657 tablet_metadata.cc:391] Destroying regular db at: /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/data/rocksdb/table-00004001000030008000000000004007/tablet-74501e39774d4024acc2dc40a210fca1
E0212 22:51:21.107319 16657 tablet_metadata.cc:395] Failed to destroy regular DB at: /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/data/rocksdb/table-00004001000030008000000000004007/tablet-74501e39774d4024acc2dc40a210fca1: IO error (yb/rocksdb/util/env_posix.cc:326): /usr/scratch/yugabyte/persistent_data/tserver/yb-data/tserver/data/rocksdb/table-00004001000030008000000000004007/tablet-74501e39774d4024acc2dc40a210fca1/LOCK: No such file or directory
I0212 22:51:21.122887 16657 ts_tablet_manager.cc:1864] T 74501e39774d4024acc2dc40a210fca1 P 0f8027232bba4785903285de35749b04: Tablet deleted. Last logged OpId: { term: 1 index: 4306 }
I0212 22:51:21.123950 16657 ts_tablet_manager.cc:1620] T 74501e39774d4024acc2dc40a210fca1 P 0f8027232bba4785903285de35749b04: Deleting tablet superblock

The failure can happen when a new table was created.

WORKAROUND: Set Master flag tablet_creation_timeout_ms=120000 (This is a timeout value for a tablet creation. Be default = 30000 ms (30 sec). Try to increase the value to a few minutes.)

ROOT CAUSE: It's possible to get race on Tablet-state during the CreateTable. Because the state change CREATING->RUNNING happens in one thread, while the state checking (if it's in CREATING - then it must be replaced after timeout) happens in CatalogManager-Background-tasks thread. So, if it happens in the order:

Master-THREAD1: Started tablet creating (CREATING state)
TS: Tablet creating takes a bit more than timeout = 30 sec by default
Master-THREAD2: The tablet was selected (in CREATING state - multiple READ-LOCK) after the timeout - for replacing - started replacing
Master-THREAD1: Got response from TS - tablet state changed into RUNNING (single WRITE-LOCK).

Master (RPC) - Selecting all tablets for the table (there is a check that it must be RUNNING) - (IN CLIENT) MetaCache updated by the tablets

Status CatalogManager::BuildLocationsForTablet(const scoped_refptr<TabletInfo>& tablet,
...
if (PREDICT_FALSE(!l_tablet->data().is_running())) {
  return STATUS(ServiceUnavailable, "Tablet not running");
}

Master replacing the tablet.
Master (new RPC) - Selecting all tablets for the table - (IN CLIENT) MetaCache updated by the new tablets - CRASH

So, the race is on steps [3] & [4] can cause such crash in MetaCahe. But probability of such scenario is quite small - because the tablet creating time on TS must be very close to the timeout (30 sec by default).

yugabyte / yugabyte-db

[docdb] TS crash due to 2 tablets in cache with the same partition_key_start #3669