[DocDB] Fix race condition with ConnectionContextWithCallId::calls_being_handled_

mdbridge commented 7 months ago

Jira Link: DB-9780

Description

[DocDB] Fix race condition with ConnectionContextWithCallId::calls_beinghandled

This variable holds information about RPCs received on the server that are currently being handled:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.h:55:
  // Calls which have been received on the server and are currently
  // being handled.
  std::unordered_map<uint64_t, InboundCall*> calls_being_handled_;

Unfortunately, it is not protected by any lock and can be simultaneously read and written, which is a data race. Worse, this data race can cause the TServer to crash with a segmentation fault because the iterator can point to garbage when the data structure is changed out from under the iterator.

We saw a crash where the reading code:

~/code/yugabyte-db/src/yb/rpc/rpc_with_call_id.cc:30:
void ConnectionContextWithCallId::DumpPB(const DumpRunningRpcsRequestPB& req,
                                         RpcConnectionPB* resp) {
  for (const auto &entry : calls_being_handled_) {
    entry.second->DumpPB(req, resp->add_calls_in_flight());
  }
}

died because entry.second was -1 due to the data structure moving to a different part of the heap in order to grow.

In terms of urgency of fixing, we suspect the read path only runs when the /rpcz end point is hit. We have not seen a crash due to two attempts to write to this data structure at the same time. Not sure if upstream is serialized somehow or if data races due to writes at once just result in incorrect results instead of crashing.

The crash was with 2.16.6 so we probably want to backport at least as far as 2.18.

This bug was introduced by https://phorge.dev.yugabyte.com/D2353.

UPDATE:

Amit informs me that there is concurrency control here: although it is not documented, this field is supposed to be only accessed from the reactor thread corresponding to the connection containing the RPC calls in question. If true and consistently applied, that should have prevented the crash we saw.

Given that we saw a crash, presumably there is bug somewhere where we violate this concurrency control. Possibly fixed after 2.16.6.

In terms of trying to narrow down where the problem is, in the crash there were both YSQL and CQL calls ongoing. The crash occurred on a TServer.

At a minimum, for this task add comments where needed describing the concurrency preconditions. If possible, add DCHECKs or the like to make sure it is being followed.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

mdbridge commented 7 months ago

please add a test that catches this problem via TSAN or the like and verify that it fails before the fix and passes afterwards

amitanandaiyer commented 7 months ago

@mdbridge as I understand this these calls are only supposed to run on the reactor thread.

  // Dump information about status of this connection context to protobuf.
  virtual void DumpPB(const DumpRunningRpcsRequestPB& req, RpcConnectionPB* resp)
      ON_REACTOR_THREAD = 0;

It should be the only thread which accesses these.

  // Takes ownership of call_data content.
  Status HandleCall(const ConnectionPtr& connection, CallData* call_data)
      ON_REACTOR_THREAD override;

      this calls Store(..)

amitanandaiyer commented 7 months ago

If this is indeed the cause though, the likelihood of running into this would increase drastically once we enable/add ash. ASH relies on the DumpPB mechanism to pull the desired wait-states.

This would be the equivalent of curling /rpcz every so often (1 sec).

mdbridge commented 7 months ago

as I posted elsewhere, there is more than one reactor thread in the system

yugabyte / yugabyte-db