yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.01k stars 1.07k forks source link

[YCQL] Change the execution model of ExecContext::PrepareChildTransaction #11258

Open bmatican opened 2 years ago

bmatican commented 2 years ago

Jira Link: DB-975

Description

This code path is prone to getting all of the worker threads stuck. It currently creates a future, which is supposed to execute work, but then inline calls get on it, in an effort to turn an async call into a sync one.

However, the work that the future is supposed to do needs a thread, which currently needs to come from the same threadpool that executes the main function. This creates the potential for ALL of the threads to be busy executing ExecContext::PrepareChildTransaction, thus not having any threads for doing the future work.

A first version to at least prevent deadlocks would be to change from future.get to some future.wait_until + error reporting to the user, eg: #4375 (already done). A big problem here though is that this could still get all the thread stuck for up to whatever timeout is specified (configurable via the cql_prepare_child_threshold_ms TServer gflag).

There are 2 alternatives for a longer, more proper fix:

  1. turning the execution model truly async, similar to what @spolitov did for the conflict resolution path (ie: 26f58dd8d2aea4990c20c15eecd418e8102cb564)
  2. build on top of the first version by retrying the future.wait_until until the statement timeout with perhaps some delay between the retries. This will help preempt the threads to give a chance for the work that the future is supposed to do.

cc @m-iancu

--------------------- Detailed explanation of issue -----------------------

In YCQL, assume there is a table test (h1 int, r1 int, v int, primary key (h1, r1)) and 2 indexes - idx1 (r1, h1) and idx2 (v1, h1). Note that all columns indexed by idx1 are pk columns in the main table test. But idx2 has a column v that isn't part of the pk set. Let us call indexes like idx1 that index only pk columns as pk-only-indexes. And let us call all other indexes non-pk-only-indexes.

When a row insertion arrives at the YCQL query layer, writes for the main table and pk-only-indexes can be directed to the correct tablet/ shard on possibly another node since all primary key columns are known from the query. However, for non-pk-only-indexes, there are a few intricacies -

Assume an INSERT (1, 2, 3). If row with pk (1, 2) doesn't actually exist, we only have to insert the row (3, 1) into the secondary index. If row with pk (1, 2) already exists, say with v=5, then the INSERT follows upsert semantics. In this case we need to delete the row with index primary key (5, 1) and insert the row with index primary key (3, 1).

Since the query layer doesn't know the old value of v for the row, it will first have to read the value of v at row (1, 2). Instead of reading the row, the YCQL query layer instead issues the write with (1, 2, 3) to the correct main table shard and that shard's tserver, having the old value of v, in turn issues the correct set of writes to the corresponding index shards (an index row insertion possibly preceeded by an index row deletion if an older value of v existed).

Allowing the main table's shard's tserver node to issue the writes for non-pk-only-indexes saves us an extra round trip that would have been incurred in case the YCQL query layer read the main table row first and then issued the index write(s).

To allow these index write(s) from possibly another node other than were the query landed, the distributed transaction's metadata, read time and some other misc information need to be sent to the main table's shard, so that it can use the same transaction for performing write(s) to the index table. Currently, all this information is sent in the ChildTransactionDataPB. It essentially has the parent TransactionMetadataPB, the read time & associated information.

The query layer populates this child transaction metadata via the ExecContext::PrepareChildTransaction code path. This might have to wait in case the distributed transaction hasn't yet been initialized (initialization of a transaction is something that happens in the background wherein the query layer receives a usable transaction id via a heartbeat from a transaction status tablet). This wait is performed by issuing the metadata population via PrepareChildFuture() which returns an std::future and then waiting on the future for 2 seconds. In case the distributed transaction is already initialized, the future's result is available immediately. Otherwise, the future's result is available only after the transaction has been initialized fully (i.e., if ready_ is true in client/transaction.cc). If the transaction initialization doesn't complete within the configurable cql_prepare_child_threshold_ms (default = 2 seconds), then an error is thrown to the external client for that INSERT.

Given this, if there is a significant load on the system, it is possible that some transaction initializations don't complete within the cql_prepare_child_threshold_ms limit and this can lead to failure of those INSERTs.

randersenYB commented 2 years ago

Could this be related to https://github.com/yugabyte/yugabyte-db/issues/10999?

bmatican commented 2 years ago

Lowering priority as we got the short term fix done.

goranc commented 2 years ago

I have an issue with this and tested with 2.12 and 2.13 but still get the same result, when inserting records all client workers are stuck and waiting for server response. It is repeatable with my table data, which have large records. What can be done with this to try to resolve the issue?

lnguyen-yugabyte commented 2 years ago

@goranc : is this possible to have a specific example to reproduce on our side?