[YSQL] YB cost model does not leverage extended statistics

Jira Link: DB-3173

Description

The existing Yugabyte cost model does not leverage user-defined extended statistics. Consider the following example.

SET work_mem = '128kB';

CREATE TABLE ndistinct (
    filler1 TEXT,
    filler2 NUMERIC,
    a INT,
    b INT,
    filler3 DATE,
    c INT,
    d INT
);

INSERT INTO ndistinct (a, b, c, filler1)
     SELECT i/100, i/100, i/100, cash_words((i/100)::money)
       FROM generate_series(1,30000) s(i);
ANALYZE ndistinct;

 EXPLAIN (ANALYZE, SUMMARY off, TIMING off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=6056.90..6386.90 rows=3000 width=16) (actual rows=301 loops=1)
   Group Key: a, b
   ->  Sort  (cost=6056.90..6131.90 rows=30000 width=8) (actual rows=30000 loops=1)
         Sort Key: a, b
         Sort Method: external merge  Disk: 552kB
         ->  Seq Scan on ndistinct  (cost=0.00..3000.00 rows=30000 width=8) (actual rows=30000 loops=1)
(6 rows)

CREATE STATISTICS s10 ON a, b, c FROM ndistinct;

ANALYZE ndistinct;

 SELECT stxkind, stxndistinct FROM pg_statistic_ext WHERE stxrelid = 'ndistinct'::regclass;
 stxkind |                      stxndistinct
---------+---------------------------------------------------------
 {d,f}   | {"3, 4": 301, "3, 6": 301, "4, 6": 301, "3, 4, 6": 301}
(1 row)

EXPLAIN (ANALYZE, SUMMARY off, TIMING off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=6056.90..6386.90 rows=3000 width=16) (actual rows=301 loops=1)
   Group Key: a, b
   ->  Sort  (cost=6056.90..6131.90 rows=30000 width=8) (actual rows=30000 loops=1)
         Sort Key: a, b
         Sort Method: external merge  Disk: 552kB
         ->  Seq Scan on ndistinct  (cost=0.00..3000.00 rows=30000 width=8) (actual rows=30000 loops=1)
(6 rows)

We can see that even after creating statistics, the estimated number of rows is 30000, however, the actual rows are 301. The extended statistics that are listed from pg_statistic_ext clearly show that the number of distinct rows is 301.

From this, we are able to see that the YB query planner does not leverage extended statistics towards row estimation and subsequently, the cost model is unaffected.

The following items needs to be addressed:

Remove these lines of the code to enable loading extended stats from the system catalog.

/* YugaByte does not support forwarding statistics to Postgres yet */
if (IsYugaByteEnabled())
    return NIL;

In yb_cost_seqscan:
- Stop stripping off the RestrictInfo from each clause. clauselist_selectivity
- At the beginning of the function, just use already properly computed estimate estimate from baserel->rows, etc. to path->rows, just like other cost_xxxscan functions.
```
/* Mark the path with the correct row estimate */
if (param_info)
path->rows = param_info->ppi_rows;
else
path->rows = baserel->rows;
```
Enable ALTER STATISTICS
- In gram.y (here and several other places, add the keyword to the list, etc.
- Add T_AlterStatsStmt to the DDL list
Fixing #23592 should address partitioned tables.
The tests in the PG11 stats_ext.sql try to verify use of the extended stats via plan changes, which do not work well because of the cost formula differences and unavailability of some plans. Start over porting the tests from PG15 that verify the row count estimate changes.

yugabyte / yugabyte-db

[YSQL] YB cost model does not leverage extended statistics #13598

Description