yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.96k stars 1.07k forks source link

[YSQL] YB cost model does not leverage extended statistics #13598

Open ramsrivatsa opened 2 years ago

ramsrivatsa commented 2 years ago

Jira Link: DB-3173

Description

The existing Yugabyte cost model does not leverage user-defined extended statistics. Consider the following example.

SET work_mem = '128kB';

CREATE TABLE ndistinct (
    filler1 TEXT,
    filler2 NUMERIC,
    a INT,
    b INT,
    filler3 DATE,
    c INT,
    d INT
);

INSERT INTO ndistinct (a, b, c, filler1)
     SELECT i/100, i/100, i/100, cash_words((i/100)::money)
       FROM generate_series(1,30000) s(i);
ANALYZE ndistinct;

 EXPLAIN (ANALYZE, SUMMARY off, TIMING off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=6056.90..6386.90 rows=3000 width=16) (actual rows=301 loops=1)
   Group Key: a, b
   ->  Sort  (cost=6056.90..6131.90 rows=30000 width=8) (actual rows=30000 loops=1)
         Sort Key: a, b
         Sort Method: external merge  Disk: 552kB
         ->  Seq Scan on ndistinct  (cost=0.00..3000.00 rows=30000 width=8) (actual rows=30000 loops=1)
(6 rows)

CREATE STATISTICS s10 ON a, b, c FROM ndistinct;

ANALYZE ndistinct;

 SELECT stxkind, stxndistinct FROM pg_statistic_ext WHERE stxrelid = 'ndistinct'::regclass;
 stxkind |                      stxndistinct
---------+---------------------------------------------------------
 {d,f}   | {"3, 4": 301, "3, 6": 301, "4, 6": 301, "3, 4, 6": 301}
(1 row)

EXPLAIN (ANALYZE, SUMMARY off, TIMING off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=6056.90..6386.90 rows=3000 width=16) (actual rows=301 loops=1)
   Group Key: a, b
   ->  Sort  (cost=6056.90..6131.90 rows=30000 width=8) (actual rows=30000 loops=1)
         Sort Key: a, b
         Sort Method: external merge  Disk: 552kB
         ->  Seq Scan on ndistinct  (cost=0.00..3000.00 rows=30000 width=8) (actual rows=30000 loops=1)
(6 rows)

We can see that even after creating statistics, the estimated number of rows is 30000, however, the actual rows are 301. The extended statistics that are listed from pg_statistic_ext clearly show that the number of distinct rows is 301.

From this, we are able to see that the YB query planner does not leverage extended statistics towards row estimation and subsequently, the cost model is unaffected.

mtakahar commented 1 month ago

The following items needs to be addressed: