yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[DocDB] Potential incorrect intent cleanup on shutdown path (in tests) #11358

Open mbautin opened 2 years ago

mbautin commented 2 years ago

Jira Link: DB-809 Found when running data-patcher-test in a loop. On minicluster shutdown, we abort a transaction apply operation on one of the participants while the other two replicas have already applied the intent.

: Aborting operation because of shutdown: id { term: 1 index: 11 } hybrid_time: 17067112918617952256 op_type: UPDATE_TRANSACTION_OP committed_op_id { term: 1 index: 10 } monotonic_counter: 0 transaction_state { transaction_id: "\206\330\272\353\363?N=\226,h\030w\303\021F" status: APPLYING tablets: "bef69bfb38ba413382c50f7bbb47041a" commit_hybrid_time: 17067112918602833920 sealed: false aborted { } }

This apparently causes the safe time to move past the timestamp of the apply operation, and CleanupAbortsTask removes the intents from this replica.

I0203 13:40:25.086756 1715096 tablet.cc:1274] T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a: Wrote 9 key/value pairs to kIntents RocksDB:
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: Frontiers: { smallest: { op_id: 1.10 hybrid_time: { days: 48226 time: 13:40:14.896108 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.10 hybrid_time: { days: 48226 time: 13:40:14.896108 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 1. SingleDeleteCF: TXN META 86d8baeb-f33f-4e3d-962c-681877c31146
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 2. SingleDeleteCF: SubDocKey(DocKey(0x7bf2, [19], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ days: 48226 time: 13:40:14.891902 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 3. SingleDeleteCF: TXN REV 86d8baeb-f33f-4e3d-962c-681877c31146 HT{ days: 48226 time: 13:40:14.891902 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 4. SingleDeleteCF: SubDocKey(DocKey(0x7bf2, [19], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ days: 48226 time: 13:40:14.891902 w: 1 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 5. SingleDeleteCF: TXN REV 86d8baeb-f33f-4e3d-962c-681877c31146 HT{ days: 48226 time: 13:40:14.891902 w: 1 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 6. SingleDeleteCF: SubDocKey(DocKey(0x7bf2, [19], []), []) [kWeakRead, kWeakWrite] HT{ days: 48226 time: 13:40:14.891902 w: 2 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 7. SingleDeleteCF: TXN REV 86d8baeb-f33f-4e3d-962c-681877c31146 HT{ days: 48226 time: 13:40:14.891902 w: 2 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 8. SingleDeleteCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ days: 48226 time: 13:40:14.891902 w: 3 }
  T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a [I]: 9. SingleDeleteCF: TXN REV 86d8baeb-f33f-4e3d-962c-681877c31146 HT{ days: 48226 time: 13:40:14.891902 w: 3 }
I0203 13:40:25.086881 1715096 cleanup_aborts_task.cc:83] T 75c304ec277f4a9b92d4c716be17f377 P f3e1d6f1d26b49da9ffc38a08f9f781a: Number of aborted transactions cleaned up: 1 of 4

One way to fix is is to "freeze" safe time (prevent it from advancing) before aborting operations on the shutdown path. We also need a better test to reproduce this, not involving time changes done in data-patcher-test.

mbautin commented 2 years ago

Here is a log (obtained with custom test changes, though) reproducing the issue: https://gist.githubusercontent.com/mbautin/26ea042c1bd13e872d9c4098c3fdff88/raw

mbautin commented 2 years ago

In this test the missing value is 11:

W0206 15:55:34.104007 786584 data-patcher-test.cc:95] Missing value (after using data-patcher to undo clock skew): 11
../../src/yb/tools/data-patcher-test.cc:101: Failure
      Expected: AsString(values)
      Which is: "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19]"
To be equal to: AsString(Range(total_values))
      Which is: "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]"
I0206 15:55:34.105664 790989 transaction_coordinator.cc:1460] T e3cf6e34c7f84d349d1fc3000e0df9a6 P b0b385e6e4b74efba9d4cc6ee6b8028b: Poll, now: { days: 19029 time: 15:55:34.105661 }, first: { id: abfd72fe-da95-4bde-99e0-91a891bd6b64 last_touch: { days: 19029 time: 15:55:17.967073 } status: PENDING involved_tablets: [] replicating: <NULL>  request_queue: [] first_entry_raft_index: 4 }, expired: 1, timeout: 5.000s, passed: 16.139s

It appears that the cleanup task might have originated in a compaction:

I0206 15:55:17.859093 788978 transaction_participant.cc:489] DEBUG myusername: Cleanup:     @     0x7f05ba3bf31e  yb::docdb::(anonymous namespace)::DocDBIntentsCompactionFilter::CompactionFinished()
    @     0x7f05b91e150f  rocksdb::CompactionJob::ProcessKeyValueCompaction()
    @     0x7f05b91df75e  rocksdb::CompactionJob::Run()
    @     0x7f05b920dc9b  rocksdb::DBImpl::BackgroundCompaction()
    @     0x7f05b920c376  rocksdb::DBImpl::BackgroundCallCompaction()
    @     0x7f05b921c08a  rocksdb::DBImpl::CompactionTask::DoRun()
    @     0x7f05b796c682  yb::(anonymous namespace)::PriorityThreadPoolWorker::Run()
    @     0x7f05b79911d5  yb::Thread::SuperviseThread()
    @     0x7f05b5824694  start_thread
    @     0x7f05b556541d  __clone

Shortly after that, we see the cluster shutdown:

I0206 15:55:17.984150 786584 data-patcher-test.cc:244] Shutting down the cluster
1644162917.986 [WARN] (src/control_connection.cpp:273:virtual void cass::ControlConnection::on_close(cass::Connection *)): Lost control connection to host 127.0.0.2 with the following error:
1644162917.986 [ERROR] (src/pool.cpp:412:virtual void cass::Pool::on_close(cass::Connection *)): Closing established connection pool to host 127.0.0.2 because of the following error:
1644162917.986 [WARN] (src/control_connection.cpp:273:virtual void cass::ControlConnection::on_close(cass::Connection *)): Lost control connection to host 127.0.0.2 with the following error: Connect error 'connection refused'
I0206 15:55:17.996924 786648 tunnel.cc:173] Shutdown tunnel: 127.0.0.3:20552 => 127.0.0.2:20552
I0206 15:55:17.996937 786584 tablet_server.cc:419] TabletServer shutting down...
I0206 15:55:17.997350 786584 ts_tablet_manager.cc:1485] P 42d9e8f756a744e9ad38691ab9ee47f6: Shutting down tablet manager...
I0206 15:55:17.997390 786648 poller.cc:66] P 42d9e8f756a744e9ad38691ab9ee47f6: Poll stopped: Aborted (yb/rpc/scheduler.cc:65): Task aborted
I0206 15:55:17.997548 786584 tablet_peer.cc:431] T 863c74f5331846e2bb99c26257873b26 P 42d9e8f756a744e9ad38691ab9ee47f6 [state=RUNNING]: Initiating TabletPeer shutdown
I0206 15:55:17.997578 786584 tablet.cc:919] T 863c74f5331846e2bb99c26257873b26 P 42d9e8f756a744e9ad38691ab9ee47f6: StartShutdown
I0206 15:55:17.997607 786651 poller.cc:66] T 863c74f5331846e2bb99c26257873b26 P 42d9e8f756a744e9ad38691ab9ee47f6: Poll stopped: Aborted (yb/rpc/scheduler.cc:65): Task aborted

However, in the middle of shutdown there is this:

I0206 15:55:27.855688 787659 tablet.cc:1277] T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1: Wrote 9 key/value pairs to kIntents RocksDB:
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: Frontiers: { smallest: { op_id: 2.20 hybrid_time: { days: 48229 time: 15:55:17.469230 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 2.20 hybrid_time: { days: 48229 time: 15:55:17.469230 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 1. SingleDeleteCF: TXN META 36e6aaa3-7f6f-4cc3-a936-cb98b1610dc8
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 2. SingleDeleteCF: SubDocKey(DocKey(0x1c99, [11], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ days: 48229 time: 15:55:17.466980 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 3. SingleDeleteCF: TXN REV 36e6aaa3-7f6f-4cc3-a936-cb98b1610dc8 HT{ days: 48229 time: 15:55:17.466980 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 4. SingleDeleteCF: SubDocKey(DocKey(0x1c99, [11], []), [ColumnId(1)]) [kStrongRead, kStrongWrite] HT{ days: 48229 time: 15:55:17.466980 w: 1 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 5. SingleDeleteCF: TXN REV 36e6aaa3-7f6f-4cc3-a936-cb98b1610dc8 HT{ days: 48229 time: 15:55:17.466980 w: 1 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 6. SingleDeleteCF: SubDocKey(DocKey(0x1c99, [11], []), []) [kWeakRead, kWeakWrite] HT{ days: 48229 time: 15:55:17.466980 w: 2 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 7. SingleDeleteCF: TXN REV 36e6aaa3-7f6f-4cc3-a936-cb98b1610dc8 HT{ days: 48229 time: 15:55:17.466980 w: 2 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 8. SingleDeleteCF: SubDocKey(DocKey([], []), []) [kWeakRead, kWeakWrite] HT{ days: 48229 time: 15:55:17.466980 w: 3 }
  T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1 [I]: 9. SingleDeleteCF: TXN REV 36e6aaa3-7f6f-4cc3-a936-cb98b1610dc8 HT{ days: 48229 time: 15:55:17.466980 w: 3 }
I0206 15:55:27.855823 787659 cleanup_aborts_task.cc:87] T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1: Number of aborted transactions cleaned up: 1 of 7
I0206 15:55:27.855834 787659 cleanup_aborts_task.cc:91] T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1: DEBUG myusername: stack trace for CleanupAbortsTask:     @     0x7f05b7d60e39  yb::rpc::Strand::Done()
    @     0x7f05b7d6a10a  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f05b79911d5  yb::Thread::SuperviseThread()
    @     0x7f05b5824694  start_thread
    @     0x7f05b556541d  __clone
I0206 15:55:27.860915 787659 cleanup_aborts_task.cc:102] T 8febade926924163aff559c26b992fcd P 020fafab8dd34734aeb9c56743f954c1: Cleanup Aborts Task finished.
I0206 15:55:27.948415 786584 scheduler.cc:223] Waiting 1 tasks to complete
mbautin commented 2 years ago

So clearly, the CleanupAbortsTask decided that safe time has reached a point past the hybrid time returned by a transaction status request, and the transaction status request did not find that transaction. The logic here is that if the transaction was committed, applied on all involved tablets (but not on this particular replica of an involved tablet) and removed from the coordinator, then by making sure our local safe time reaches the hybrid time returned by coordinator (which is at least the hybrid time of any Raft record on the coordinator, including the APPLIED_IN_ALL_INVOLVED_TABLETS record's hybrid time) we make sure that we would have seen the APPLIED record in this transaction participant replica as well if that record existed.

mbautin commented 2 years ago

Here is a shortened test log for convenience (ending at the incorrect cleanup mentioned above): https://gist.github.com/mbautin/eaeb93c49d90ecd0d91f8f149ff212e8/raw