yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[DocDB][Tablet Splitting][PITR] Queries and PITR are timing out. #16268

Closed Arjun-yb closed 1 year ago

Arjun-yb commented 1 year ago

Jira Link: DB-5693

Description

Version: 2.17.2.0-b151

Steps:

  1. Create a universe with RPC, packed columns and automatic tablet splitting enabled
    m_flags = {
            "enable_automatic_tablet_splitting": "true",
            "tablet_split_high_phase_shard_count_per_node": 200,
            "tablet_split_high_phase_size_threshold_bytes": 102400,  # 100 KB
            "tablet_split_low_phase_size_threshold_bytes": 10240,  # 10 KB
            "tablet_split_low_phase_shard_count_per_node": 16,
            "tablet_split_limit_per_table": 512,
            "enable_stream_compression": "true",
            "stream_compression_algo": 3,
            "ysql_enable_packed_row": "true",
        }
        t_flags = {
            "enable_automatic_tablet_splitting": "true",
            "yb_num_shards_per_tserver": 1,
            "ysql_num_shards_per_tserver": 1,
            "enable_stream_compression": "true",
            "stream_compression_algo": 3,
            "ysql_enable_packed_row": "true",
        }
  2. Create a colocated database and non colocated databases
  3. Create tables, optout tables
  4. Enable PITR: Create a snapshot schedule(on non-colocated DB)
  5. Record current timestamp t1
  6. Perform below steps for both colocated and non colocated tables
  7. Insert some data(20K) into tables and validate data, tablets
  8. Take backup(b1)
  9. Delete some data from tables and observe the tablet counts
  10. Restore to t1 and observe the tablet counts, load some more data
  11. Copy data from table into a file 'f1'
  12. Delete all rows from table
  13. Copy data from 'f1' into table and validate
  14. Record current timestamp t2
  15. Start workload and do following when the splitting is in progress
    1. Alter table add a column
    2. Alter table drop a column
  16. Validate data and observe the tablet counts
  17. Restore to t2

Observations: 1 . At step#17 - it is timing out(PITR is timing out)

  1. While running this scenario manually, some basic select queries are timing out after step#10. After some time those queries are working fine.
Arjun-yb commented 1 year ago

I observed the issue with 2.17.3.0-b123(in which the fix is landed) as well

sanketkedia commented 1 year ago

@Arjun-yb can you share the master/tserver logs for this run?

rthallamko3 commented 1 year ago

https://github.com/yugabyte/yugabyte-db/issues/16669 tracks the deadlock that causes the recent failure.

rthallamko3 commented 1 year ago

@Arjun-yb , I am closing this issue as the next follow is tracked as part of https://github.com/yugabyte/yugabyte-db/issues/16669. This keeps the backport tracking simple.