yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.88k stars 1.05k forks source link

[YSQL] [Parallel Query] Parallel query feature doesn't scale well with number of workers. #21667

Open shantanugupta-yb opened 6 months ago

shantanugupta-yb commented 6 months ago

Jira Link: DB-10557

Description

The ql_read_latency is close to half when parallel workers are doubled but the overall latencies are not reducing in same proportion. Colocated tables + Analyse executed on all tables before executing the query

yb_colocated=# explain (analyse,dist,debug) /*+ set(max_parallel_workers_per_gather 2) */ select count(*) from pkeyBigint1M_1 as a join pkeyBigint1M_2 as b on a.col_bigint_card1_1=b.col_bigint_card1_1 and a.col_bigint_card3_1=b.col_bigint_card3_1 where b.col_bigint_card1_1=1000000 and  b.col_bigint_card2_1 in (1000058);
                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1043885903.87..1043885903.88 rows=1 width=8) (actual time=264.310..264.310 rows=1 loops=1)
   ->  Gather  (cost=1043885903.65..1043885903.86 rows=2 width=8) (actual time=264.204..268.543 rows=3 loops=1)
         **Workers Planned: 2
         Workers Launched: 2**
         ->  Partial Aggregate  (cost=1043884903.65..1043884903.66 rows=1 width=8) (actual time=244.688..244.688 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=271491.62..2218236.98 rows=416666666667 width=0) (actual time=244.682..244.683 rows=0 loops=3)
                     Hash Cond: (b.col_bigint_card3_1 = a.col_bigint_card3_1)
                     ->  Parallel Index Scan using pkeybigint1m_2_col_bigint_card2_1_idx on pkeybigint1m_2 b  (cost=16.89..406.52 rows=416667 width=16) (never executed)
                           Index Cond: (col_bigint_card2_1 = 1000058)
                           Remote Filter: (col_bigint_card1_1 = 1000000)
                           Estimated Seeks: 101
                           Estimated Nexts: 100
                     ->  Parallel Hash  (cost=271469.55..271469.55 rows=415 width=16) (actual time=243.408..243.408 rows=0 loops=3)
                           Buckets: 1024  Batches: 1  Memory Usage: 0kB
                           ->  Parallel Seq Scan on pkeybigint1m_1 a  (cost=8.44..271469.55 rows=415 width=16) (actual time=240.007..240.007 rows=0 loops=3)
                                 Remote Filter: (col_bigint_card1_1 = 1000000)
                                 Storage Table Read Requests: 97
                                 Storage Table Read Execution Time: 82.748 ms
                                 Metric rocksdb_block_cache_hit: 3686.000
                                 Metric rocksdb_block_cache_index_hit: 291.000
                                 Metric rocksdb_block_cache_data_hit: 3395.000
                                 Metric rocksdb_block_cache_bytes_read: 110167611.000
                                 Metric rocksdb_number_db_seek: 97.000
                                 Metric rocksdb_number_db_next: 117950.667
                                 Metric rocksdb_number_db_seek_found: 97.000
                                 Metric rocksdb_number_db_next_found: 117950.667
                                 Metric rocksdb_iter_bytes_read: 103052994.333
                                 Metric rocksdb_block_cache_multi_touch_hit: 3686.000
                                 Metric rocksdb_block_cache_multi_touch_bytes_read: 110167611.000
                                 Metric docdb_keys_found: 117950.667
                                 Metric ql_read_latency: sum: 66060.667, count: 97.000
                                 Estimated Seeks: 1
                                 Estimated Nexts: 999999
 Planning Time: 0.219 ms
 Execution Time: 268.629 ms
 Storage Read Requests: 291
 Storage Read Execution Time: 248.245 ms
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Metric rocksdb_block_cache_hit: 11058
 Metric rocksdb_block_cache_index_hit: 873
 Metric rocksdb_block_cache_data_hit: 10185
 Metric rocksdb_block_cache_bytes_read: 330502833
 Metric rocksdb_number_db_seek: 291
 Metric rocksdb_number_db_next: 353852
 Metric rocksdb_number_db_seek_found: 291
 Metric rocksdb_number_db_next_found: 353852
 Metric rocksdb_iter_bytes_read: 309158983
 Metric rocksdb_block_cache_multi_touch_hit: 11058
 Metric rocksdb_block_cache_multi_touch_bytes_read: 330502833
 Metric docdb_keys_found: 353852
 Metric ql_read_latency: sum: 198182, count: 291
 **Storage Execution Time: 248.245 ms**
 Peak Memory Usage: 1329 kB
(56 rows)

**Time: 269.426 ms**

Time: 220.502 ms

-  **latencies with 8 parallel workers**
```plpgsql
yb_colocated=# explain (analyse,dist,debug) /*+ set(max_parallel_workers_per_gather 8) */ select count(*) from pkeyBigint1M_1 as a join pkeyBigint1M_2 as b on a.col_bigint_card1_1=b.col_bigint_card1_1 and a.col_bigint_card3_1=b.col_bigint_card3_1 where b.col_bigint_card1_1=1000000 and  b.col_bigint_card2_1 in (1000058);
                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=312758676.70..312758676.71 rows=1 width=8) (actual time=211.988..211.988 rows=1 loops=1)
   ->  Gather  (cost=312758675.87..312758676.68 rows=8 width=8) (actual time=211.550..225.205 rows=8 loops=1)
         **Workers Planned: 8
         Workers Launched: 7**
         ->  Partial Aggregate  (cost=312757675.87..312757675.88 rows=1 width=8) (actual time=167.360..167.360 rows=1 loops=8)
               ->  Parallel Hash Join  (cost=81465.23..257675.87 rows=125000000000 width=0) (actual time=167.356..167.356 rows=0 loops=8)
                     Hash Cond: (b.col_bigint_card3_1 = a.col_bigint_card3_1)
                     ->  Parallel Index Scan using pkeybigint1m_2_col_bigint_card2_1_idx on pkeybigint1m_2 b  (cost=16.89..133.78 rows=125000 width=16) (never executed)
                           Index Cond: (col_bigint_card2_1 = 1000058)
                           Remote Filter: (col_bigint_card1_1 = 1000000)
                           Estimated Seeks: 101
                           Estimated Nexts: 100
                     ->  Parallel Hash  (cost=81446.78..81446.78 rows=125 width=16) (actual time=162.936..162.937 rows=0 loops=8)
                           Buckets: 1024  Batches: 1  Memory Usage: 0kB
                           ->  Parallel Seq Scan on pkeybigint1m_1 a  (cost=8.44..81446.78 rows=125 width=16) (actual time=159.511..159.511 rows=0 loops=8)
                                 Remote Filter: (col_bigint_card1_1 = 1000000)
                                 Storage Table Read Requests: 18
                                 Storage Table Read Execution Time: 25.193 ms
                                 Metric rocksdb_block_cache_hit: 691.125
                                 Metric rocksdb_block_cache_index_hit: 54.750
                                 Metric rocksdb_block_cache_data_hit: 636.375
                                 Metric rocksdb_block_cache_bytes_read: 20641235.125
                                 Metric rocksdb_number_db_seek: 18.250
                                 Metric rocksdb_number_db_next: 23310.250
                                 Metric rocksdb_number_db_seek_found: 18.250
                                 Metric rocksdb_number_db_next_found: 23310.250
                                 Metric rocksdb_iter_bytes_read: 20365173.875
                                 Metric rocksdb_block_cache_multi_touch_hit: 691.125
                                 Metric rocksdb_block_cache_multi_touch_bytes_read: 20641235.125
                                 Metric docdb_keys_found: 23310.250
                                 **Metric ql_read_latency: sum: 16213.625, count: 18.250**
                                 Estimated Seeks: 1
                                 Estimated Nexts: 999999
 Planning Time: 0.197 ms
 Execution Time: 225.298 ms
 Storage Read Requests: 146
 Storage Read Execution Time: 201.541 ms
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Metric rocksdb_block_cache_hit: 5529
 Metric rocksdb_block_cache_index_hit: 438
 Metric rocksdb_block_cache_data_hit: 5091
 Metric rocksdb_block_cache_bytes_read: 165129881
 Metric rocksdb_number_db_seek: 146
 Metric rocksdb_number_db_next: 186482
 Metric rocksdb_number_db_seek_found: 146
 Metric rocksdb_number_db_next_found: 186482
 Metric rocksdb_iter_bytes_read: 162921391
 Metric rocksdb_block_cache_multi_touch_hit: 5529
 Metric rocksdb_block_cache_multi_touch_bytes_read: 165129881
 Metric docdb_keys_found: 186482
 Metric ql_read_latency: sum: 129709, count: 146
 **Storage Execution Time: 201.541 ms**
 Peak Memory Usage: 1065 kB
(56 rows)

**Time: 226.059 ms**

Table schema

yb_colocated=# \d pkeyBigint1M_1
                         Table "public.pkeybigint1m_1"
       Column        |          Type          | Collation | Nullable | Default 
---------------------+------------------------+-----------+----------+---------
 col_bigint_card1_1  | bigint                 |           | not null | 
 col_bigint_card2_1  | bigint                 |           |          | 
 col_bigint_card3_1  | bigint                 |           |          | 
 col_bigint_id_1     | bigint                 |           | not null | 
 col_bigint_id_2     | bigint                 |           | not null | 
 col_bigint_id_3     | bigint                 |           | not null | 
 col_float2_1        | real                   |           |          | 
 col_float2_2        | real                   |           |          | 
 col_float5_1        | real                   |           |          | 
 col_float5_2        | real                   |           |          | 
 col_boolean_1       | boolean                |           |          | 
 col_varchar10_id_1  | character varying(10)  |           |          | 
 col_varchar100_id_1 | character varying(100) |           |          | 
 col_varchar100_id_2 | character varying(100) |           |          | 
 col_varchar500_id_1 | character varying(500) |           |          | 
Indexes:
    "pkeybigint1m_1_pkey" PRIMARY KEY, lsm (col_bigint_id_1 ASC, col_bigint_id_2 ASC, col_bigint_id_3 ASC, col_bigint_card1_1 ASC), colocation: true
    "pkeybigint1m_1_col_bigint_card2_1_idx" lsm (col_bigint_card2_1 ASC), colocation: true
Colocation: true

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented 6 months ago

@shantanugupta-yb , can you confirm if the scale problem exists for larger datasets/queries that run longer than a few seconds?

rthallamko3 commented 5 months ago

@shantanugupta-yb has revamped the test with larger datasets, we continue to see the problem. See preliminary report