yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.03k stars 1.08k forks source link

[YSQL] Even after running Analyse on table the cost estimation seems to incorrect which is why inefficient plan is selected for joins #16383

Open shantanugupta-yb opened 1 year ago

shantanugupta-yb commented 1 year ago

Jira Link: DB-5797

Description

In case of join condition on range index with one filter condition of range partitioned table again nested loop was selected and query executed in 17030ms . Post execution of Analyse on both the tables "Hash join" was selected where latency reduced to 1627ms but with "Merge join" the query got executed in 205ms.

So even after running Analyse on both tables which updates the table statistics but still the cost estimation seems to be incorrect.

Before running analyse nest loop is selected

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..21.12 rows=50 width=16) (actual time=2.409..17014.271 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..5.12 rows=10 width=8) (actual time=2.124..43.668 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
         Heap Fetches: 0
         Storage Index Read Requests: 98
         Storage Index Execution Time: 2.000 ms
   ->  Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2  (cost=0.00..1.50 rows=10 width=8) (actual time=0.163..0.164 rows=1 loops=100000)
         Output: rangetbl_2.col_bigint_id_2
         Index Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Execution Time: 0.149 ms
 Planning Time: 0.130 ms
 Execution Time: 17029.421 ms
 Storage Read Requests: 100098
 Storage Write Requests: 0
 Storage Execution Time: 14861.840 ms
 Peak Memory Usage: 24 kB

Post running analyse on both the tables involved in join, Hash join is selected by default

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                                                                                                                                                                  QUERY PLAN                                  

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1254.00..105104.00 rows=10000 width=16) (actual time=192.835..1611.669 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   Hash Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
   ->  Seq Scan on public.rangetbl_2  (cost=0.00..100000.00 rows=1000000 width=8) (actual time=1.570..1216.975 rows=1000000 loops=1)
         Output: rangetbl_2.col_bigint_id_1, rangetbl_2.col_bigint_id_2, rangetbl_2.col_bigint_id_3, rangetbl_2.col_bigint_id_4, rangetbl_2.col_bigint_1, rangetbl_2.col_bigint_2, rangetbl_2.col_float2_1, rangetbl_2.col_float2_2, rangetbl_2.col_float5_1, 
rangetbl_2.col_float5_2, rangetbl_2.col_boolean_1, rangetbl_2.col_varchar10_id_1, rangetbl_2.col_varchar100_id_1, rangetbl_2.col_varchar100_id_2, rangetbl_2.col_varchar500_id_1
         Storage Table Read Requests: 977
         Storage Table Execution Time: 1125.986 ms
   ->  Hash  (cost=1129.00..1129.00 rows=10000 width=8) (actual time=191.141..191.142 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         Buckets: 131072 (originally 16384)  Batches: 2 (originally 1)  Memory Usage: 3073kB
         ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.991..172.984 rows=100000 loops=1)
               Output: rangetbl_1.col_bigint_id_2
               Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
               Heap Fetches: 0
               Storage Index Read Requests: 98
               Storage Index Execution Time: 144.998 ms
 Planning Time: 5.239 ms
 Execution Time: 1616.398 ms
 Storage Read Requests: 1075
 Storage Write Requests: 0
 Storage Execution Time: 1270.984 ms
 Peak Memory Usage: 5107 kB

But the most optimised query plan w.r.t to latency is running the query with merge join

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=41.85..113783.00 rows=10000 width=16) (actual time=3.806..198.793 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   Merge Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
   ->  Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2  (cost=0.00..110004.00 rows=1000000 width=8) (actual time=1.826..25.152 rows=100001 loops=1)
         Output: rangetbl_2.col_bigint_id_2
         Heap Fetches: 0
         Storage Index Read Requests: 99
         Storage Index Execution Time: 2.000 ms
   ->  Materialize  (cost=0.00..1154.00 rows=10000 width=8) (actual time=1.973..146.444 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.961..134.984 rows=100000 loops=1)
               Output: rangetbl_1.col_bigint_id_2
               Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
               Heap Fetches: 0
               Storage Index Read Requests: 98
               Storage Index Execution Time: 107.999 ms
 Planning Time: 0.133 ms
 Execution Time: 203.371 ms
 Storage Read Requests: 197
 Storage Write Requests: 0
 Storage Execution Time: 109.999 ms
 Peak Memory Usage: 84 kB

Table schema

                           Table "public.rangetbl_1"
       Column        |          Type          | Collation | Nullable | Default 
---------------------+------------------------+-----------+----------+---------
 col_bigint_id_1     | bigint                 |           | not null | 
 col_bigint_id_2     | bigint                 |           |          | 
 col_bigint_id_3     | bigint                 |           |          | 
 col_bigint_id_4     | bigint                 |           |          | 
 col_bigint_1        | bigint                 |           |          | 
 col_bigint_2        | bigint                 |           |          | 
 col_float2_1        | real                   |           |          | 
 col_float2_2        | real                   |           |          | 
 col_float5_1        | real                   |           |          | 
 col_float5_2        | real                   |           |          | 
 col_boolean_1       | boolean                |           |          | 
 col_varchar10_id_1  | character varying(10)  |           |          | 
 col_varchar100_id_1 | character varying(100) |           |          | 
 col_varchar100_id_2 | character varying(100) |           |          | 
 col_varchar500_id_1 | character varying(500) |           |          | 
Indexes:
    "rangetbl_1_pkey" PRIMARY KEY, lsm (col_bigint_id_1 ASC)
    "rangetbl_1_col_bigint_id_2_idx" lsm (col_bigint_id_2 ASC)

Warning: Please confirm that this issue does not contain any sensitive information

tanujnay112 commented 1 year ago

@shantanugupta-yb Did you enable the guc variable 'yb_enable_optimizer_statistics' before trying this?

shantanugupta-yb commented 1 year ago

@tanujnay112, yes I did enable yb_enable_optimizer_statistics variable. Here are the observations for same https://yugabyte.slack.com/archives/C03RB7JHKM0/p1678519579065579?thread_ts=1678442689.915799&cid=C03RB7JHKM0

mtakahar commented 1 year ago

Merge Join (cost=41.85..113783.00 rows=10000 width=16) (actual time=3.806..198.793 rows=100000 loops=1)

The access path to rangetbl_1 is exactly the same and placed on the inner side in both plans. The Materialize node in the Merge Join plan is more expensive than the Hash node but only slightly (+25).

    ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.991..172.984 rows=100000 loops=1)

Hash:

-> Seq Scan on public.rangetbl_2 (cost=0.00..100000.00 rows=1000000 width=8) (actual time=1.570..1216.975 rows=1000000 loops=1)

Merge:

-> Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2 (cost=0.00..110004.00 rows=1000000 width=8) (actual time=1.826..25.152 rows=100001 loops=1)

Seq Scan output (Hash):

Output: rangetbl_2.col_bigint_id_1, rangetbl_2.col_bigint_id_2, rangetbl_2.col_bigint_id_3, rangetbl_2.col_bigint_id_4, rangetbl_2.col_bigint_1, rangetbl_2.col_bigint_2, rangetbl_2.col_float2_1, rangetbl_2.col_float2_2, rangetbl_2.col_float5_1, rangetbl_2.col_float5_2, rangetbl_2.col_boolean_1, rangetbl_2.col_varchar10_id_1, rangetbl_2.col_varchar100_id_1, rangetbl_2.col_varchar100_id_2, rangetbl_2.col_varchar500_id_1

Index Only Scan output (Merge):

Output: rangetbl_2.col_bigint_id_2

The difference in the execution time of the join node by itself seems reasonable: Hash: 1611.669-(1216.975+191.142) = 203.552 Merge: 198.793-(25.152+146.444) = 27.197