yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.84k stars 1.05k forks source link

[DocDB] Tserver error: Duplicate memory Tracker(id LogCache-228736c6d3e4459f921ab53bb938bca1) on parent LogCache->server->root #22401

Open shishir2001-yb opened 3 months ago

shishir2001-yb commented 3 months ago

Jira Link: DB-11302

Description

Version: 2024.1.0.0-b123 Logs: https://drive.google.com/file/d/1GPXCLUnbwOLDvPhFtI-BrIBUUjFDhsgS/view?usp=sharing(4.4 GB) Check Jira to directly view logs

Encountered the following Tserver while running cross DB DDLs test with PITR and Backup/Restore.

(Universe logs -> 172.151.18.212 -> yb-tserver.ip-172-151-18-212.us-west-2.compute.internal.yugabyte.log.ERROR.20240502-214518.1267781)

E0502 21:45:18.905700 1268060 mem_tracker.cc:309] Duplicate memory tracker (id LogCache-228736c6d3e4459f921ab53bb938bca1) on parent LogCache->server->root

Test details:

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 10 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

G-flags:

 tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented 3 months ago

Per @yusong-yan : Customer impact: No impact to workload. Errors might be spewed to the log.

Engineering impact: We might lose memory tracking for certain LogCache structure of the tablets (Mostly the ones that got shutdown and bootstrapped again very quickly). Also, per tablet MemTracker might not account for the memory, and this can appear to look like a memory leak.