Open robertsami opened 1 year ago
for example, with the following transaction:
begin;
select * from foo where k = 1 for update;
commit;
if we have a single node local cluster on a 6 core machine, and we run this in 8 parallel threads, we will see something like 1/1000 requests experience 100x avg latency
A tentative root cause has been identified, with a POC fix seemingly solving the issue. We have the following two code paths racing with each other:
Path 1 --
Path 2 --
step 3 of path 2 and step 4 of path 1 need to be synchronized, else T2 may miss the signal from the participant and be stuck in the wait queue until Poll() is called
Depends on #21404
Jira Link: DB-5848
Description
We currently depend on a polling-based approach to resolve waiting transactions in the wait queue in order to achieve fairness under highly-contentious workloads, e.g. a workload where 10s of sessions are concurrently locking the same row.
Without aggressive polling (e.g. setting
wait_queue_poll_interval_ms=5
), such highly contentious workloads will suffer from high p99 latenciesOnce we have https://github.com/yugabyte/yugabyte-db/issues/13578, we should ensure that highly contentious workloads can function with predictable p99 performance even with
wait_queue_poll_interval_ms=100
or larger. Otherwise, we are trading off significant CPU overhead for fairnessWarning: Please confirm that this issue does not contain any sensitive information