Open shishir2001-yb opened 4 months ago
Related logs:
./Universe_logs/172.151.30.222/tserver/yb-tserver.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-162955.31769:I0425 17:00:42.809218 38746 client_master_rpc.cc:77] 0x000017f674a51920 -> IsCreateNamespaceDone: Failed, got resp error: Internal error (yb/master/catalog_manager.cc:9366): Namespace Create Failed: not onlined.
./Universe_logs/172.151.30.222/tserver/postgresql-2024-04-25_164530.log:2024-04-25 17:00:42.818 UTC [911524] ERROR: Namespace Create Failed: not onlined.
Apparently, the error is passed back from tserver to PG.
./Universe_logs/172.151.30.222/master/yb-master.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-165150.31325:W0425 17:00:41.807782 911535 catalog_manager.cc:9230] Service unavailable (yb/tablet/operations/operation_tracker.cc:190): Error copying PGSQL system tables for pending namespace: Operation of type kWrite failed: tablet 00000000000000000000000000000000 hit the limit 1622684467 of memory tracker 0x00002797bfda4520 -> root while trying to consume an additional 393916 bytes; the memory tracker had already given out 1735573504 bytes.
The master log seems to tell the root cause of the failure. The memory is capped so the create namespace operation failed.
./Universe_logs/172.151.30.222/master/yb-master.ip-172-151-30-222.us-west-2.compute.internal.yugabyte.log.INFO.20240425-132437.31325.gz:I0425 13:24:37.393837 31325 mem_tracker.cc:268] Root memory limit is 1622684467
So, we create 150 databases (100 normal databases and 50 colocated databases), and continuously perform 20-25 parallel DDLs in this test. At the end of step 3, we drop the database and try to restore it from the backup created at the start of step 3.
Instance type: c6g.2xlarge
When running the test on larger instance c6g.4xlarge we didn’t see this issue. On smaller instance with PITR steps (the test originally also did PITR snapshots and restores), we didn't see this issue either.
When no longer performing PITR, maybe we’re no longer removing DDL-related data from the master (at PITR restore time), this might help to explain why when PITR steps are removed master runs out of memory.
Jira Link: DB-11104
Description
Tried on version 2024.1.0.0-b102 Logs:
Create database query during a backup restore failed with the below error, the same name database was dropped ~14 mins before. DB Name: postgres_99
Test details:
G-flags:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information