yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[DocDB] Fatal: Failed to write a batch with 0 operations into RocksDB: Not found (yb/tablet/tablet_metadata.cc:1812): Cannot find table info for: b8440000-0000-0080-0030-000075400000, raft group id: bd20a9a262ac4644ac4287d72e0b9706 #21891

Open shishir2001-yb opened 7 months ago

shishir2001-yb commented 7 months ago

Jira Link: DB-10791

Description

Version: 2024.1.0.0-b54 Logs: https://drive.google.com/file/d/1RBSDOfq6znFyOO8L6mCu4ysINzPLLrOu/view?usp=sharing Encountered the following Fatal while running cross DB DDLs test with PITR and Backup/Restore.

F20240408 21:23:20 ../../src/yb/tablet/tablet.cc:1517] T bd20a9a262ac4644ac4287d72e0b9706 P d3284631a4df4c678b60418cdb67802d: Failed to write a batch with 0 operations into RocksDB: Not found (yb/tablet/tablet_metadata.cc:1812): Cannot find table info for: b8440000-0000-0080-0030-000075400000, raft group id: bd20a9a262ac4644ac4287d72e0b9706
    @     0xaaaacb7b5d7c  google::LogMessage::SendToLog()
    @     0xaaaacb7b6c20  google::LogMessage::Flush()
    @     0xaaaacb7b72bc  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaaaccb1d470  yb::tablet::Tablet::WriteToRocksDB()
    @     0xaaaaccb19798  yb::tablet::Tablet::ApplyIntents()
    @     0xaaaaccbc4154  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0xaaaaccaf66a4  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0xaaaaccae9b84  yb::tablet::Operation::Replicated()
    @     0xaaaaccaec180  yb::tablet::OperationDriver::ReplicationFinished()
    @     0xaaaacbc9135c  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0xaaaacbcdc434  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0xaaaacbcdb7b0  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0xaaaacbcb89a8  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0xaaaacbc85adc  yb::rpc::StrandTaskWithErrorFunc<>::Run()
    @     0xaaaacca4d210  yb::rpc::Strand::Done()
    @     0xaaaacca56764  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0xaaaacd2400f8  yb::Thread::SuperviseThread()
    @     0xffffb2ee78b8  start_thread
    @     0xffffb2f43afc  thread_start

Test details:

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 10 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

G-flags:

 tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented 6 months ago

@shishir2001-yb , Does this still repro?

shishir2001-yb commented 6 months ago

@rthallamko3, yes, I have seen this issue in 2024.1.0.0-b102 just this morning.

myang2021 commented 6 months ago

@rthallamko3 http://stress.dev.yugabyte.com/files/get?name=e74903e5-9ea6-4e73-9436-db87fa4a0e58-yb-tserver.ip-172-151-17-169.us-west-2.compute.internal.yugabyte.log.FATAL.20240503-095329.1283707:

Log file created at: 2024/05/03 09:53:29
Current UTC time: 2024/05/03 09:53:29
Running on machine: ip-172-151-17-169.us-west-2.compute.internal
Application fingerprint: version 2024.1.0.0 build 123 revision 5d4bbda924b78ed59f8cb87c0aabaa54dab486ee build_type RELEASE built at 01 May 2024 16:50:03 UTC
Node information: { hostname: 'ip-172-151-17-169.us-west-2.compute.internal', rpc_ip: '172.151.17.169', webserver_ip: '172.151.17.169', uuid: '868cf183c82643ad9bbc14258ad92104' }
Running duration (h:mm:ss): 0:55:29
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0503 09:53:29.552929 1290024 tablet.cc:1517] T b571cc4dc0d74a7e8acdb80ee723ca60 P 868cf183c82643ad9bbc14258ad92104: Failed to write a batch with 0 operations into RocksDB: Not found (yb/tablet/tablet_metadata.cc:1816): Cannot find table info for: 43460000-0000-0080-0030-000090400000, raft group id: b571cc4dc0d74a7e8acdb80ee723ca60
rthallamko3 commented 6 months ago

cc @lingamsandeep as well. There are known limitations around PITR and drop tables. Not sure if this test is running into similar problems.