E1118 12:07:39.984506 263073 ts_tablet_manager.cc:2086] T b7386e5049624e75aafc390011eeac8b P fd339b15e1e7407aae1ee7587357f450: Tablet failed to bootstrap: Illegal state (yb/tablet/tablet_bootstrap.cc:1684): Failed log replay. Reason: WAL files missing, or committed op id is incorrect. Expected both term and index of prev_op_id to be greater than or equal to the corresponding components of committed_op_id. prev_op_id=0.0, committed_op_id=1.14
2024-11-18 12:00:42.247 UTC [86199] ERROR: Timed out waiting kResponseSent, state: kRequestSent
2024-11-18 12:00:42.247 UTC [86199] STATEMENT: CREATE TABLE public.noncolocateddbnoncolocatedtable1_p7000 PARTITION OF public.noncolocateddbnoncolocatedtable1
FOR VALUES FROM (7000) TO (8000)
PARTITION BY RANGE (age)
SPLIT INTO 2 TABLETS;
Test details:
1. Create a cluster with required g-flags
2. Schema creation:
a. Create 2 non-colocated DB
b. Create 2 tables in non-colocated db
(One is partitioned by time and the other by integer)
c. Create Materialized views on all the tables (CREATE partition)
d. Create indexes on all these tables (CREATE INDEX ON THESE)
3. Create pg_partman extension with/without schema (TBD) and create pg_cron extension
4. Create PITR on all the databases and note down time T0
5. Try creating a partition using create_parent(), pass 'partman' as p_type, and
verify it fails with below error:
ERROR: partman is not a valid partitioning type for pg_partman
6. Try creating partition set on a table which doesn't exist and verify it fails
with the expected error
7. Try creating a sub-partition set on parent table which doesn't have any partition
set and verify it fails with:
ERROR: Cannot subpartition a table that is not managed by pg_partman already.
Given top parent table not found in public.part_config: public.t2
8. Try creating a partition set on a column which is not part of the partition key and
verify it fails with the expected error.
9. TBA: Create a partition set on a table which already has partitioned tables
(overlapping), verify it fails with:
10. Drop extension and tables and re-create them
11. Create partition (create_parent()) set for each table and validate if it's working
by adding a few rows for each partition (Keep interval as low as possible).
12. Create sub-partition (create_sub_parent()) for each partition set, validate old
data is not present and validate new rows are getting added to both parent
partition and sub-partition
13. Schedule a pg_cron job which will insert rows continuously into all the tables.
14. Schedule a pg_cron job which will run (run_maintenance()/run_maintenance_proc())
every 5 minutes
15. Start a thread to verify if (run_maintenance()/run_maintenance_proc()) is dropping
the partition tables and creating new partition tables
16. Start a thread to update and delete data
17. Let this run for 20 minutes, this should verify the functionality
18. Stop the DML ops threads and cron jobs.
19. Note down the time (T1) and the data.
20. Resume the DML ops threads and cron jobs.
21. Sleep for 10 minutes
22. Stop the DML ops threads and cron jobs.
23. Restore to time T1 and validate the data
24. Refresh all the Materialized views and note down the row count for each table
25. Manually delete one of the existing partitions (Let's say P1)
26. Run partition_gap_fill() and verify P1 is re-created and has data. Verify if
data is deleted
27. Refresh all the Materialized views and validate row count is the same as step 24.
28. Alter a few tables and rename the column- Verify the column name is also renamed
in partitioned tables (ALTER, Truncate, etc.)
29. Resume the DML ops threads and cron jobs.
30. Sleep for 10 minutes
31. Stop the DML ops threads and cron jobs.
32. Run drop_partition_time() and drop_partition_id() and verify the partitioned tables
are detached from the parent table.
33. Create a backup of some database (COPY FROM/TO COPY TO, ADD)
34. Drop those databases
35. Restore all the databases and verify schema and data is intact. ---------------------------------------->>>>>>>ISSUE OCCURRED HERE
36. Resume the DML ops threads and cron jobs.
37. Verify everything is working.
38. Stop the DML ops threads and cron jobs.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
[X] I confirm this issue does not contain any sensitive information.
It is actually related to PITR + RBS + DeleteTable + a recently fixed bug, so multiple stars do need to align properly for this issue to happen :slightly_smiling_face:.
A PITR was performed on the database at : 1119 18:25:32
Subsequently the table was dropped - I am assuming as a result of drop database in order to perform a restore. The table was dropped at : I1119 18:27:00.931933
Unfortunately this drop of table/database ran into the RBS race in bug, triggering a RBS on a tablet that was anyway being dropped as part of drop table:
Once the RBS completed, there was only 1 WAL segment that had to be replayed which had the RESTORE_ON_TABLET op from the PITR operation earlier.
Apply of this RESTORE operation failed an assertion check where the opid obtained from RBS is ahead of the opid that RESTORE_ON_TABLET was attempting to write causing the tserver to crash.
Jira Link: DB-14113
Description
Version: 2024.2.0.0-b127 Logs: Added in Jira comments
Encountered the following Fatal during restore of a backup in a Pg partman test.
Also saw the following tserver error
Note: We also saw the coredump mentioned in following https://github.com/yugabyte/yugabyte-db/issues/24929
YBC logs indicates the restore failed due to-
In postgres logs we see Create table timed out, so it could be because of the master crash(Same as https://github.com/yugabyte/yugabyte-db/issues/24929
Test details:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information