[YSQL][Read Committed] ERROR: Commit of expired transaction

shishir2001-yb commented 1 year ago

Jira Link: DB-8189

Description

Tried on version: 2.21.0.0-b29

While running Read committed sample app with (300 write threads and 20 read threads)=>~320 parallel transactions, 2-3 transactions got aborted with the following error

ERROR: Commit of expired transaction

Instance type: c5.2xlarge(8 core/ 16GB RAM)

Sample app details(Check Jira to view the code and download the sample app):

- Start 20 read threads (1 connection in each thread)
             a. Select 1-10 rows, Choose one of the following select types(FOR (UPDATE, SHARE, NO KEY UPDATE,
                   KEY SHARE, "") 
- Start 300 Write Threads (1 connection in each thread) ~320(300 + 20) parallel Transaction : 
        Perform the following operations in all the write threads, once the thread completes start a new thread
             a. Insert 100 rows 
             b. Update 1 row (This may conflict with other transaction)
             c. Select 1 row (FOR (UPDATE, SHARE, NO KEY UPDATE, KEY SHARE, "") (This may conflict with other transaction)
             d. Delete 1 row (This may conflict with other transactions)

G-flags used:

enable_wait_queues: true,
enable_deadlock_detection: true,
yb_enable_read_committed_isolation:true,
ysql_output_buffer_size: 56214400 # 50 MB

Logs: Added in JIRA

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

basavaraj29 commented 9 months ago

this issues seems to be similar to the one raised here https://github.com/yugabyte/yugabyte-db/issues/20208.

from the logs, consider the status tablet 1c48dd1cafe34e6fa07668303f1ec052. there is a constant leadership churn among peers 9920540afca441058f09483dfe236dda and aef182d8d26f4083b78a71c949bb25ff from 1003 12:47:43.587924.

I1003 12:47:43.587898  8376 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 1 FOLLOWER]: Fail or stepdown of leader aef182d8d26f4083b78a71c949bb25ff detected. Triggering leader pre-election, mod
e=ELECT_EVEN_IF_LEADER_IS_ALIVE
I1003 12:47:45.144307 951442 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 2 LEADER]: Becoming Leader. State: Replica: 9920540afca441058f09483dfe236dda, State: 1, Role: LEADER, Watermarks: {Received: 1.132234 Committed: 1.132234} Leader: 0.0
...
I1003 12:48:01.276885 13919 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P aef182d8d26f4083b78a71c949bb25ff [term 2 FOLLOWER]: Fail or stepdown of leader 9920540afca441058f09483dfe236dda detected. Triggering leader election, mode=NORMAL_ELECTION
I1003 12:48:01.286674  8236 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P aef182d8d26f4083b78a71c949bb25ff [term 3 LEADER]: Becoming Leader. State: Replica: aef182d8d26f4083b78a71c949bb25ff, State: 1, Role: LEADER, Watermarks: {Received: 2.132407 Committed: 2.132407} Leader: 0.0
...
I1003 12:48:08.470780 952065 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 3 FOLLOWER]: Fail or stepdown of leader aef182d8d26f4083b78a71c949bb25ff detected. Triggering leader election, mode=ELECT_EVEN_IF_LEADER_IS_ALIVE
I1003 12:48:08.608135 952080 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 4 LEADER]: Becoming Leader. State: Replica: 9920540afca441058f09483dfe236dda, State: 1, Role: LEADER, Watermarks: {Received: 3.132423 Committed: 3.132423} Leader: 0.0

running transactions heartbeat to the status tablet every 0.5 secs, and a coordinator aborts a transaction if the last heartbeat was > 5 secs ago, with an error message Commit of expired transaction. So if there is some leadership churn among a status tablet and if the in-progress replicating ops are aborted due to leadership switch (term advancement), then there is a good possibility of the new leader expiring active transactions.

rthallamko3 commented 8 months ago

@shishir2001-yb , Whats the impact of the error/failure on the test? Are you blocked on this bug? How often does it repro? cc @basavaraj29 , @robertsami

shishir2001-yb commented 8 months ago

@rthallamko3, My test is not blocked with this issue, I have added it as an known exception. To what I can recall this doesn't repro very frequently

rthallamko3 commented 8 months ago

Seems like this happens very infrequently and there is a suspicion that the tablet splits can cause it. @robertsami , @basavaraj29 , Can we move this to backlog?

basavaraj29 commented 8 months ago

Additional note: This issue isn't WoC specific.

jasonyb commented 8 months ago

There are multiple issues with this message: https://github.com/yugabyte/yugabyte-db/issues?q=is%3Aissue+is%3Aopen+Commit+of+expired+transaction

On almalinux 8, recent master ba3761c73a7db1473141cc86c7a1fd033755102d, got the same message

./yb_build.sh fastdebug --gcc11 --java-test TestPgRegressTypesNumeric -n 1000 --tp 1

On iteration 27 of 42:

*** ${TEST_TMPDIR}/pgregress_output/yb_pg_types_numeric_serial_schedule/expected/yb_pg_numeric_big.out  2024-01-25 12:28:59.132799817 -0800                                                                                                                                                                                                                                  
--- ${TEST_TMPDIR}/pgregress_output/yb_pg_types_numeric_serial_schedule/results/yb_pg_numeric_big.out 2024-01-25 12:28:59.035793512 -0800                                                                                                                                                                                                                                    
***************                                                                                                                                                                                                                                                                                                                                                              
*** 9,14 ****                                                                                                                                                                                                                                                                                                                                                                
--- 9,15 ----                                                                                                                                                                                                                                                                                                                                                                
  CREATE TABLE num_big_exp_sqrt (id int4, expected numeric(1000,800));                                                                                                                                                                                                                                                                                                       
  CREATE TABLE num_big_exp_ln (id int4, expected numeric(1000,800));                                                                                                                                                                                                                                                                                                         
  CREATE TABLE num_big_exp_log10 (id int4, expected numeric(1000,800));                                                                                                                                                                                                                                                                                                      
+ ERROR:  Commit of expired transaction                                                                                                                                                                                                                                                                                                                                      
  CREATE TABLE num_big_exp_power_10_ln (id int4, expected numeric(1000,800));                                                                                                                                                                                                                                                                                                
  CREATE TABLE num_big_result (id1 int4, id2 int4, result numeric(1000,800));                                                                                                                                                                                                                                                                                                
  -- ******************************                                                                                                                                                                                                                                                                                                                                          
***************

Did not change any flags. They are the same as in the original issue report except for ysql_output_buffer_size.

enable_wait_queues: true by default
enable_deadlock_detection: deprecated (disable_deadlock_detection is default false)
yb_enable_read_committed_isolation: true by default in fastdebug
ysql_output_buffer_size: 262144 by default

rthallamko3 commented 6 months ago

This can happen even with fail on conflict.

rthallamko3 commented 5 months ago

Allow a longer time for the transactions to expire rather than the current 0.5second timeout at the ysql <-> transaction coordinator heartbeat.

yugabyte / yugabyte-db