[YSQL] Even after running Analyse on table the cost estimation seems to incorrect which is why inefficient plan is selected for joins

Jira Link: DB-5797

Description

In case of join condition on range index with one filter condition of range partitioned table again nested loop was selected and query executed in 17030ms . Post execution of Analyse on both the tables "Hash join" was selected where latency reduced to 1627ms but with "Merge join" the query got executed in 205ms.

So even after running Analyse on both tables which updates the table statistics but still the cost estimation seems to be incorrect.

Before running analyse nest loop is selected

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..21.12 rows=50 width=16) (actual time=2.409..17014.271 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..5.12 rows=10 width=8) (actual time=2.124..43.668 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
         Heap Fetches: 0
         Storage Index Read Requests: 98
         Storage Index Execution Time: 2.000 ms
   ->  Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2  (cost=0.00..1.50 rows=10 width=8) (actual time=0.163..0.164 rows=1 loops=100000)
         Output: rangetbl_2.col_bigint_id_2
         Index Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Execution Time: 0.149 ms
 Planning Time: 0.130 ms
 Execution Time: 17029.421 ms
 Storage Read Requests: 100098
 Storage Write Requests: 0
 Storage Execution Time: 14861.840 ms
 Peak Memory Usage: 24 kB

Post running analyse on both the tables involved in join, Hash join is selected by default

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                                                                                                                                                                  QUERY PLAN                                  

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1254.00..105104.00 rows=10000 width=16) (actual time=192.835..1611.669 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   Hash Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
   ->  Seq Scan on public.rangetbl_2  (cost=0.00..100000.00 rows=1000000 width=8) (actual time=1.570..1216.975 rows=1000000 loops=1)
         Output: rangetbl_2.col_bigint_id_1, rangetbl_2.col_bigint_id_2, rangetbl_2.col_bigint_id_3, rangetbl_2.col_bigint_id_4, rangetbl_2.col_bigint_1, rangetbl_2.col_bigint_2, rangetbl_2.col_float2_1, rangetbl_2.col_float2_2, rangetbl_2.col_float5_1, 
rangetbl_2.col_float5_2, rangetbl_2.col_boolean_1, rangetbl_2.col_varchar10_id_1, rangetbl_2.col_varchar100_id_1, rangetbl_2.col_varchar100_id_2, rangetbl_2.col_varchar500_id_1
         Storage Table Read Requests: 977
         Storage Table Execution Time: 1125.986 ms
   ->  Hash  (cost=1129.00..1129.00 rows=10000 width=8) (actual time=191.141..191.142 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         Buckets: 131072 (originally 16384)  Batches: 2 (originally 1)  Memory Usage: 3073kB
         ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.991..172.984 rows=100000 loops=1)
               Output: rangetbl_1.col_bigint_id_2
               Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
               Heap Fetches: 0
               Storage Index Read Requests: 98
               Storage Index Execution Time: 144.998 ms
 Planning Time: 5.239 ms
 Execution Time: 1616.398 ms
 Storage Read Requests: 1075
 Storage Write Requests: 0
 Storage Execution Time: 1270.984 ms
 Peak Memory Usage: 5107 kB

But the most optimised query plan w.r.t to latency is running the query with merge join

yugabyte=# explain (analyse, verbose, dist) select Rangetbl_1.col_bigint_id_2,Rangetbl_2.col_bigint_id_2 from Rangetbl_1 join Rangetbl_2 on Rangetbl_1.col_bigint_id_2=Rangetbl_2.col_bigint_id_2 where Rangetbl_1.col_bigint_id_2<100001;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=41.85..113783.00 rows=10000 width=16) (actual time=3.806..198.793 rows=100000 loops=1)
   Output: rangetbl_1.col_bigint_id_2, rangetbl_2.col_bigint_id_2
   Merge Cond: (rangetbl_2.col_bigint_id_2 = rangetbl_1.col_bigint_id_2)
   ->  Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2  (cost=0.00..110004.00 rows=1000000 width=8) (actual time=1.826..25.152 rows=100001 loops=1)
         Output: rangetbl_2.col_bigint_id_2
         Heap Fetches: 0
         Storage Index Read Requests: 99
         Storage Index Execution Time: 2.000 ms
   ->  Materialize  (cost=0.00..1154.00 rows=10000 width=8) (actual time=1.973..146.444 rows=100000 loops=1)
         Output: rangetbl_1.col_bigint_id_2
         ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.961..134.984 rows=100000 loops=1)
               Output: rangetbl_1.col_bigint_id_2
               Index Cond: (rangetbl_1.col_bigint_id_2 < 100001)
               Heap Fetches: 0
               Storage Index Read Requests: 98
               Storage Index Execution Time: 107.999 ms
 Planning Time: 0.133 ms
 Execution Time: 203.371 ms
 Storage Read Requests: 197
 Storage Write Requests: 0
 Storage Execution Time: 109.999 ms
 Peak Memory Usage: 84 kB

Table schema

                           Table "public.rangetbl_1"
       Column        |          Type          | Collation | Nullable | Default 
---------------------+------------------------+-----------+----------+---------
 col_bigint_id_1     | bigint                 |           | not null | 
 col_bigint_id_2     | bigint                 |           |          | 
 col_bigint_id_3     | bigint                 |           |          | 
 col_bigint_id_4     | bigint                 |           |          | 
 col_bigint_1        | bigint                 |           |          | 
 col_bigint_2        | bigint                 |           |          | 
 col_float2_1        | real                   |           |          | 
 col_float2_2        | real                   |           |          | 
 col_float5_1        | real                   |           |          | 
 col_float5_2        | real                   |           |          | 
 col_boolean_1       | boolean                |           |          | 
 col_varchar10_id_1  | character varying(10)  |           |          | 
 col_varchar100_id_1 | character varying(100) |           |          | 
 col_varchar100_id_2 | character varying(100) |           |          | 
 col_varchar500_id_1 | character varying(500) |           |          | 
Indexes:
    "rangetbl_1_pkey" PRIMARY KEY, lsm (col_bigint_id_1 ASC)
    "rangetbl_1_col_bigint_id_2_idx" lsm (col_bigint_id_2 ASC)

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

The difference between the cost estimates of the Hash vs. Merge Join plans is 8679.

Hash Join (cost=1254.00..105104.00 rows=10000 width=16) (actual time=192.835..1611.669 rows=100000 loops=1)

Merge Join (cost=41.85..113783.00 rows=10000 width=16) (actual time=3.806..198.793 rows=100000 loops=1)

The access path to rangetbl_1 is exactly the same and placed on the inner side in both plans. The Materialize node in the Merge Join plan is more expensive than the Hash node but only slightly (+25).

    ->  Index Only Scan using rangetbl_1_col_bigint_id_2_idx on public.rangetbl_1  (cost=0.00..1129.00 rows=10000 width=8) (actual time=1.991..172.984 rows=100000 loops=1)

A major difference comes from the access path rangetbl_2.

Hash:

-> Seq Scan on public.rangetbl_2 (cost=0.00..100000.00 rows=1000000 width=8) (actual time=1.570..1216.975 rows=1000000 loops=1)

Merge:

-> Index Only Scan using rangetbl_2_col_bigint_id_2_idx on public.rangetbl_2 (cost=0.00..110004.00 rows=1000000 width=8) (actual time=1.826..25.152 rows=100001 loops=1)

The Seq Scan cost (Hash plan) is estimated cheaper than the Index Only Scan (Merge plan) by 10004, however, the Seq Scan is much slower than the Index Only Scan. 1216.975 vs. 25.152. There's no predicate pushed down to DocDB either as a remote filter or the index access key. This costing vs. execution time discrepancy seems to be the primary source of the problem.
On the execution side, we don't have to fetch (unless packed-column table) and send over unused columns in case of this Seq Scan. We should investigate If we are actually doing so as indicated in the plan, too. Update: Tried a similar Hash join plan with Seq Scan on the outer side and checked the DocDB request. We are not sending all the columns over the network. We are computing actually necessary columns in yb_scan.c instead of blindly looking at the target list.

Seq Scan output (Hash):

Output: rangetbl_2.col_bigint_id_1, rangetbl_2.col_bigint_id_2, rangetbl_2.col_bigint_id_3, rangetbl_2.col_bigint_id_4, rangetbl_2.col_bigint_1, rangetbl_2.col_bigint_2, rangetbl_2.col_float2_1, rangetbl_2.col_float2_2, rangetbl_2.col_float5_1, rangetbl_2.col_float5_2, rangetbl_2.col_boolean_1, rangetbl_2.col_varchar10_id_1, rangetbl_2.col_varchar100_id_1, rangetbl_2.col_varchar100_id_2, rangetbl_2.col_varchar500_id_1

Index Only Scan output (Merge):

Output: rangetbl_2.col_bigint_id_2

The difference in the execution time of the join node by itself seems reasonable: Hash: 1611.669-(1216.975+191.142) = 203.552 Merge: 198.793-(25.152+146.444) = 27.197

yugabyte / yugabyte-db

[YSQL] Even after running Analyse on table the cost estimation seems to incorrect which is why inefficient plan is selected for joins #16383

Description

Warning: Please confirm that this issue does not contain any sensitive information