[YSQL] Large time gaps between kConflict errors causing statement timeout in RC with wait queues

pkj415 commented 1 year ago

Jira Link: DB-7755

Description

As seen in this run - https://jenkins.dev.yugabyte.com/view/Test%20Jobs/job/itest-system-developer/8273/, an UPDATE statement repeatedly faced kConflict errors until the statement times out.

2023-08-29 07:56:44.915 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 0
2023-08-29 07:56:44.915 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:57:57.658 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 1
2023-08-29 07:57:57.658 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:58:04.953 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 2
2023-08-29 07:58:04.953 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:58:14.013 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 3
2023-08-29 07:58:14.013 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:58:21.429 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 4
2023-08-29 07:58:21.429 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:58:34.040 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 5
2023-08-29 07:58:34.040 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 07:58:41.209 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 6
2023-08-29 07:58:41.209 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:00:39.244 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 7
2023-08-29 08:00:39.244 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:00:45.952 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 8
2023-08-29 08:00:45.952 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:00:58.331 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 9
2023-08-29 08:00:58.331 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:01:06.155 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 10
2023-08-29 08:01:06.155 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:01:15.591 UTC [48606] LOG:  Restarting statement due to kReadRestart/kConflict error:
    Query: UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
    Error: could not serialize access due to concurrent update
    Attempt No: 11
2023-08-29 08:01:15.591 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'

2023-08-29 08:01:26.312 UTC [48606] LOG:  Restart isn't possible, code 0 isn't a read restart/conflict/deadlock error
2023-08-29 08:01:26.312 UTC [48606] STATEMENT:  UPDATE test_read_committed_table SET name='1001_UPDATED_' WHERE ID LIKE '2%' or ID LIKE '3%'
2023-08-29 08:01:26.318 UTC [48606] ERROR:  canceling statement due to statement timeout

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

shishir2001-yb commented 1 year ago

Steps to repro:

1. Create a cluster with enable_wait_queues, enable_deadlock_detection and yb_enable_read_committed_isolation enabled
2. Insert 10k rows
3. Start 4 parallel Threads with different connection:
     i.   BEGIN TRANSACTION;
     ii. SELECT * FROM table WHERE ID LIKE '{thread_id+1}%' or ID LIKE '{thread_id+2}%' {X}; X Can be any one of the following-> "FOR UPDATE", "FOR NO KEY UPDATE", "", "FOR KEYSHARE", "FOR SHARE"
     iv.  UPDATE table SET name="UPDATED_" WHERE ID LIKE '{4-thread_id+1}%' or ID LIKE '{4-thread_id+2}% '
     v.   COMMIT;
4. Sleep for 30 mins and let the threads run.

pkj415 commented 1 year ago

I tried some experiments locally, and this behaviour (i.e., UPDATE retries for a statement being 6 secs to ~1/1.5 min apart and the retry count on query layer going in double digits [aka starvation]) seems expected. I tried a rudimentary ysql_bench that mimics something like this with ~10k-20k rows (row count as per slack conversation with @shishir2001-yb). It seems like SELECT FOR can take ~30-40 seconds since each row is locked separately and the UPDATE can take ~6 seconds. So, if a transaction unluckily has to wait for 2-3 other transactions, it might have to wait for more than a minute.

Moreover, while the SELECT/UPDATE is being retries by the query layer, other transactions might squeeze in and take locks, hence repeatedly starving a transaction.

yugabyte / yugabyte-db

[YSQL] Large time gaps between kConflict errors causing statement timeout in RC with wait queues #18896

Description

Warning: Please confirm that this issue does not contain any sensitive information