Open shishir2001-yb opened 1 year ago
this issues seems to be similar to the one raised here https://github.com/yugabyte/yugabyte-db/issues/20208.
from the logs, consider the status tablet 1c48dd1cafe34e6fa07668303f1ec052
. there is a constant leadership churn among peers 9920540afca441058f09483dfe236dda
and aef182d8d26f4083b78a71c949bb25ff
from 1003 12:47:43.587924
.
I1003 12:47:43.587898 8376 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 1 FOLLOWER]: Fail or stepdown of leader aef182d8d26f4083b78a71c949bb25ff detected. Triggering leader pre-election, mod
e=ELECT_EVEN_IF_LEADER_IS_ALIVE
I1003 12:47:45.144307 951442 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 2 LEADER]: Becoming Leader. State: Replica: 9920540afca441058f09483dfe236dda, State: 1, Role: LEADER, Watermarks: {Received: 1.132234 Committed: 1.132234} Leader: 0.0
...
I1003 12:48:01.276885 13919 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P aef182d8d26f4083b78a71c949bb25ff [term 2 FOLLOWER]: Fail or stepdown of leader 9920540afca441058f09483dfe236dda detected. Triggering leader election, mode=NORMAL_ELECTION
I1003 12:48:01.286674 8236 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P aef182d8d26f4083b78a71c949bb25ff [term 3 LEADER]: Becoming Leader. State: Replica: aef182d8d26f4083b78a71c949bb25ff, State: 1, Role: LEADER, Watermarks: {Received: 2.132407 Committed: 2.132407} Leader: 0.0
...
I1003 12:48:08.470780 952065 raft_consensus.cc:634] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 3 FOLLOWER]: Fail or stepdown of leader aef182d8d26f4083b78a71c949bb25ff detected. Triggering leader election, mode=ELECT_EVEN_IF_LEADER_IS_ALIVE
I1003 12:48:08.608135 952080 raft_consensus.cc:1070] T 1c48dd1cafe34e6fa07668303f1ec052 P 9920540afca441058f09483dfe236dda [term 4 LEADER]: Becoming Leader. State: Replica: 9920540afca441058f09483dfe236dda, State: 1, Role: LEADER, Watermarks: {Received: 3.132423 Committed: 3.132423} Leader: 0.0
running transactions heartbeat to the status tablet every 0.5 secs, and a coordinator aborts a transaction if the last heartbeat was > 5 secs ago, with an error message Commit of expired transaction
. So if there is some leadership churn among a status tablet and if the in-progress replicating ops are aborted due to leadership switch (term advancement), then there is a good possibility of the new leader expiring active transactions.
@shishir2001-yb , Whats the impact of the error/failure on the test? Are you blocked on this bug? How often does it repro? cc @basavaraj29 , @robertsami
@rthallamko3, My test is not blocked with this issue, I have added it as an known exception. To what I can recall this doesn't repro very frequently
Seems like this happens very infrequently and there is a suspicion that the tablet splits can cause it. @robertsami , @basavaraj29 , Can we move this to backlog?
Additional note: This issue isn't WoC specific.
There are multiple issues with this message: https://github.com/yugabyte/yugabyte-db/issues?q=is%3Aissue+is%3Aopen+Commit+of+expired+transaction
On almalinux 8, recent master ba3761c73a7db1473141cc86c7a1fd033755102d, got the same message
./yb_build.sh fastdebug --gcc11 --java-test TestPgRegressTypesNumeric -n 1000 --tp 1
On iteration 27 of 42:
*** ${TEST_TMPDIR}/pgregress_output/yb_pg_types_numeric_serial_schedule/expected/yb_pg_numeric_big.out 2024-01-25 12:28:59.132799817 -0800
--- ${TEST_TMPDIR}/pgregress_output/yb_pg_types_numeric_serial_schedule/results/yb_pg_numeric_big.out 2024-01-25 12:28:59.035793512 -0800
***************
*** 9,14 ****
--- 9,15 ----
CREATE TABLE num_big_exp_sqrt (id int4, expected numeric(1000,800));
CREATE TABLE num_big_exp_ln (id int4, expected numeric(1000,800));
CREATE TABLE num_big_exp_log10 (id int4, expected numeric(1000,800));
+ ERROR: Commit of expired transaction
CREATE TABLE num_big_exp_power_10_ln (id int4, expected numeric(1000,800));
CREATE TABLE num_big_result (id1 int4, id2 int4, result numeric(1000,800));
-- ******************************
***************
Did not change any flags. They are the same as in the original issue report except for ysql_output_buffer_size
.
enable_wait_queues: true by default
enable_deadlock_detection: deprecated (disable_deadlock_detection is default false)
yb_enable_read_committed_isolation: true by default in fastdebug
ysql_output_buffer_size: 262144 by default
This can happen even with fail on conflict.
Allow a longer time for the transactions to expire rather than the current 0.5second timeout at the ysql <-> transaction coordinator heartbeat.
Jira Link: DB-8189
Description
Tried on version: 2.21.0.0-b29
While running Read committed sample app with (300 write threads and 20 read threads)=>~320 parallel transactions, 2-3 transactions got aborted with the following error
ERROR: Commit of expired transaction
Instance type: c5.2xlarge(8 core/ 16GB RAM)
Sample app details(Check Jira to view the code and download the sample app):
G-flags used:
Logs: Added in JIRA
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information