[YSQL] Deadlock involving WaitForBackendsCatalogVersion

myang2021 commented 1 year ago

Jira Link: DB-7616

Description

When I ran the test PgIndexBackfillTest.PgStatProgressCreateIndexPhase via command

./yb_build.sh debug --sjb --cxx-test pgwrapper_pg_index_backfill-test --gtest_filter PgIndexBackfillTest.PgStatProgressCreateIndexPhase --clang16

The test passed in master branch (commit dfc08e86469eca5a4cb9641635770f977641e9db) which does not have per-database catalog version mode. However, when I ran the test with per-database catalog version mode turned on by default: change the default value of --FLAGS_TEST_enable_db_catalog_version_mode to true, then the above test failed.

After debugging, I found there is a potential deadlock scenario. Following are the detailed steps that lead to the deadlock:

(1)

The tablet server (ts-1) runs WaitForBackendsCatalogVersion (called by PG DefineIndex function as part of create index workflow). There is a macro defined for this method

  #define PG_CLIENT_SESSION_METHOD_FORWARD(r, data, method) \
  Status method( \
      const BOOST_PP_CAT(BOOST_PP_CAT(Pg, method), RequestPB)& req, \
      BOOST_PP_CAT(BOOST_PP_CAT(Pg, method), ResponsePB)* resp, \
      rpc::RpcContext* context) { \
    return VERIFY_RESULT(GetSession(req))->method(req, resp, context); \
  }

Note that this macro has GetSession(req), which is defined as

  Result<PgClientSessionLocker> GetSession(uint64_t session_id) {
    return PgClientSessionLocker(VERIFY_RESULT(DoGetSession(session_id)));
  }

In particular, it returns a PgClientSessionLocker object.

  explicit PgClientSessionLocker(const LockablePtr& lockable)
      : lockable_(lockable), lock_(lockable->mutex_) {
  }

The constructor of PgClientSessionLocker automatically locks the lockable, in this case is the session object.

Secondly, the session object will be locked until method(req, resp, context) completes. method in this case is

Status PgClientSession::WaitForBackendsCatalogVersion(
    const PgWaitForBackendsCatalogVersionRequestPB& req,
    PgWaitForBackendsCatalogVersionResponsePB* resp,
    rpc::RpcContext* context) {
  // TODO(jason): send deadline to client.
  const int num_lagging_backends = VERIFY_RESULT(client().WaitForYsqlBackendsCatalogVersion(
      req.database_oid(), req.catalog_version(), context->GetClientDeadline()));
  resp->set_num_lagging_backends(num_lagging_backends);
  return Status::OK();
}

So we know that PgClientSession is now locked and will not be released until its method WaitForBackendsCatalogVersion completes.

(2) Now let's consider how client().WaitForYsqlBackendsCatalogVersion is implemented and we will see the deadlock.

Result<int> YBClient::WaitForYsqlBackendsCatalogVersion(
    PgOid database_oid, uint64_t version, const CoarseTimePoint& deadline) {
  WaitForYsqlBackendsCatalogVersionRequestPB req;
  WaitForYsqlBackendsCatalogVersionResponsePB resp;

  req.set_database_oid(database_oid);
  req.set_catalog_version(version);

  DCHECK(deadline != CoarseTimePoint()) << ToString(deadline);

  CALL_SYNC_LEADER_MASTER_RPC_WITH_DEADLINE(Admin, req, resp, deadline,
                                            WaitForYsqlBackendsCatalogVersion);

So it is sending WaitForYsqlBackendsCatalogVersion RPC to master. At master side:

  MASTER_SERVICE_IMPL_ON_LEADER_WITHOUT_LOCK(
      YsqlBackendsManager,
      (WaitForYsqlBackendsCatalogVersion)
  )

It basically calls YsqlBackendsManager::WaitForYsqlBackendsCatalogVersion, where a BackendsCatalogVersionJob is created:

  std::shared_ptr<BackendsCatalogVersionJob> job;

(3) The BackendsCatalogVersionJob job is started via BackendsCatalogVersionJob::Launch, which in turn launches a task for each tablet server:

  for (const auto& ts_uuid : ts_uuids) {
    RETURN_NOT_OK(LaunchTS(ts_uuid, -1 /* num_lagging_backends */, epoch_));
  }

BackendsCatalogVersionJob::LaunchTS starts a BackendsCatalogVersionTS task for every tablet server. This of course, includes ts-1 that we mentioned in step (1).

The BackendsCatalogVersionTS follows RetryingTSRpcTask work flow:

class BackendsCatalogVersionTS : public RetryingTSRpcTask {

It sends out RPC to a tablet server:

bool BackendsCatalogVersionTS::SendRequest(int attempt) {
  tserver::WaitForYsqlBackendsCatalogVersionRequestPB req;
  {
    auto job = job_.lock();
    req.set_database_oid(job->database_oid());
    req.set_catalog_version(job->target_version());
    req.set_prev_num_lagging_backends(prev_num_lagging_backends_);
  }

  ts_admin_proxy_->WaitForYsqlBackendsCatalogVersionAsync(req, &resp_, &rpc_, BindRpcCallback());
  VLOG(0) << "Send " << description() << " to " << permanent_uuid()
          << " (attempt " << attempt << "): " << req.ShortDebugString();
  return true;
}

Note WaitForYsqlBackendsCatalogVersionAsync is called to send a WaitForYsqlBackendsCatalogVersion RPC to ts-1.

(4) Now back in ts-1:

How ts-1 handles WaitForYsqlBackendsCatalogVersion?

TabletServiceAdminImpl::WaitForYsqlBackendsCatalogVersion

It builds a query:

  const std::string num_lagging_backends_query = Format(
      "SELECT count(*) FROM pg_stat_activity WHERE catalog_version < $0 AND datid = $1",
      catalog_version, database_oid);

Then a connection is made to its local PG master to start a new PG backend.

Next, pg_stat_activity is defined as a view which involves function pg_stat_get_activity:

CREATE VIEW pg_stat_activity AS
    SELECT
......
    FROM pg_stat_get_activity(NULL) AS S
        LEFT JOIN pg_database AS D ON (S.datid = D.oid)
        LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid)
        LEFT JOIN (pg_stat_get_backend_idset() beid CROSS JOIN
                   pg_stat_get_backend_pid(beid) pid) B ON B.pid = S.pid;

It was found that function pg_stat_get_activity is not always called, depending on the other JOIN clauses defined above. But if pg_stat_get_activity is called, then we have a problem. Inside pg_stat_get_activity it calls

        HandleYBStatus(YBCPgActiveTransactions(txn_infos, num_backends));

YBCStatus YBCPgActiveTransactions(YBCPgSessionTxnInfo *infos, size_t num_infos) {
  return ToYBCStatus(pgapi->GetActiveTransactions(infos, num_infos));
}

Status PgApiImpl::GetActiveTransactions(YBCPgSessionTxnInfo* infos, size_t num_infos) {
  std::unordered_map<uint64_t, Slice> txns;
  txns.reserve(num_infos);
  return pg_client_.EnumerateActiveTransactions(make_lw_function(

  Status EnumerateActiveTransactions(
      const ActiveTransactionCallback& callback, bool for_current_session_only) {
......
    RETURN_NOT_OK(proxy_->GetActiveTransactionList(req, &resp, PrepareController()));

We can see eventually it makes a RPC GetActiveTransactionList back to ts-1 itself.

(5) In GetActiveTransactionList

    for (const auto& session : sessions_snapshot) {
      AddTransactionInfo(resp, PgClientSessionLocker(session));
    }

The call PgClientSessionLocker(session) makes a PgClientSessionLocker object, the constructor of PgClientSessionLocker automatically locks the lockable, in this case is the same session object that has already been locked above in step (1). The deadlock!

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

myang2021 commented 1 year ago

The function GetActiveTransactionsList is newly introduced by the following commit

d2a700a3a35edc3a0e74a5ee72d36e9cac7319c1
Author: Karthik Ramanathan <kramanathan@yugabyte.com>
Date:   Wed Aug 2 17:30:08 2023 -0700

    [#17087, #18054] YSQL: Exposes YB Transaction IDs in Postgres.

It is mentioned in the commit summary:

    This change introduces a new RPC `GetActiveTransactionsList` for this purpose.

myang2021 commented 1 year ago

    decltype(sessions_) sessions_snapshot;
    {
      std::lock_guard lock(mutex_);
      sessions_snapshot = sessions_;
    }

The purpose looks like we try to make a snapshot of the current sessions, but end up making a copy of references to those sessions in the sessions_. If we can truly make a copy of clones to those sessions, then we will not have deadlock.

myang2021 commented 1 year ago

  boost::multi_index_container<
      LockablePgClientSessionPtr,
      boost::multi_index::indexed_by<
          boost::multi_index::hashed_unique<
              boost::multi_index::const_mem_fun<PgClientSession, uint64_t, &PgClientSession::id>
          >
      >
  > sessions_ GUARDED_BY(mutex_);

Note LockablePgClientSessionPtr is a pointer:

using LockablePgClientSessionPtr = std::shared_ptr<LockablePgClientSession>;

myang2021 commented 1 year ago

But how can we safely make a clone of LockablePgClientSession when its mutex_ is already locked? By lock semantics if an object is locked, it is not readable because it may be undergoing a critical change that needs to be atomic and that's why a lock is needed. So the sessions_snapshot did not work as expected and I don't see how to make it work as expected.

I think we can fix the bug by

    for (const auto& session : sessions_snapshot) {
      AddTransactionInfo(resp, session);
    }

This assumes that AddTransactionInfo will only access the immutable (read-only) part of session. It does look so:

void AddTransactionInfo(
    PgGetActiveTransactionListResponsePB* out, const PgClientSessionLocker& locker) {
  auto& session = *locker;
  const auto* txn_id = session.GetTransactionId();
  if (!txn_id) {
    return;
  }

  auto& entry = *out->add_entries();
  entry.set_session_id(session.id());
  txn_id->AsSlice().CopyToBuffer(entry.mutable_txn_id());
}

We can see that session.GetTransactionId(); reads a pointer back, I think it is safe without locking mutex_ of session.

myang2021 commented 1 year ago

Thinking more, it's tricky and hard to ensure safety without locking mutex_. We can consider adding an active_txns_ and each time we add a new txn to a session, we add also insert it there. Any time we disassociate a txn with a session, we remove it from there. active_txns_ will have its own mutex, in this way we avoid the deadlock.

jasonyb commented 1 year ago

Forget the deadlock: GetActiveTransactionList taking locks makes pg_stat_activity unusably slow. This kind of information could be pre-populated close to pgstat code instead of pgstat having to go all the way to the pggate layer to get the information on the fly. Worst is when a CREATE INDEX is running and someone selects from pg_stat_activity: that's going to hang for a while.

jasonyb commented 1 year ago

For instance, master flag --TEST_simulate_slow_table_create_secs=20 plus CREATE TABLE t (i int); simultaneously with select * from pg_stat_activity hangs for up to 20s waiting on that. A more realistic example not using that flag is multiple sessions running lock-taking DDLs, where each will cause around 0.5-1s wait due to RPC latency, and multiply that by, say 10 sessions doing DDLs out of a total of 300 sessions, means pg_stat_activity takes 5-10s--too long.

rthallamko3 commented 12 months ago

No others backports pending right?

karthik-ramanathan-3006 commented 12 months ago

No more backports left, all done! 👍

yugabyte / yugabyte-db

[YSQL] Deadlock involving WaitForBackendsCatalogVersion #18711

Description

Warning: Please confirm that this issue does not contain any sensitive information