yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.9k stars 1.06k forks source link

[YSQL] Enabling CBO without running ANALYZE may lead to suboptimal plans #16825

Open qvad opened 1 year ago

qvad commented 1 year ago

Jira Link: DB-6175

Description

Full list of queries can be found here, also subqueries set shows bad execution time. https://github.com/qvad/taqo/blob/main/sql/complex/queries/distinct.sql

There is comparison between 2 runs - default and with CBO and TABLE ANALYZE.

This is one example from queries that started perform bad with CBO/ANALYZE

SELECT DISTINCT t500000.c_int,
                t50000.c_bool
FROM   t1000000 right
    OUTER JOIN t500000
        ON t1000000.c_text = t500000.c_text right
    OUTER JOIN t50000
        ON t1000000.c_text = t50000.c_text
WHERE  t1000000.c_int < 474525
ORDER BY t500000.c_int, t50000.c_bool desc limit 1000 offset 10

Here is execution plan difference:

Screenshot 2023-04-12 at 19 02 30

Warning: Please confirm that this issue does not contain any sensitive information

mtakahar commented 1 year ago

Those plans highlighted in red and green are the same, including the row counts and the cost estimates. The red one is showing the runtime stats and the actual hash bucket adjustments happened at runtime, but green one doesn't.

It is possible that the initial number of hash buckets and whether to do "batched hash join" (partitioned hash join in more general term) were different, but can't tell without the runtime information from the green one.

Can you try it again with ANALYZE option specified to the EXPLAIN command for the faster run and post the plan diff?

qvad commented 1 year ago

This is compact reproducer for other case:

CREATE TABLE t1
    WITH (colocation = true) AS
SELECT k1,
       (k1 + 0.0001)::text as k2,
       k1 as v1,
       (k1 + 0.0002):: varchar as v2 FROM generate_Series(1, 50000) k1;

CREATE TABLE t2
    WITH (colocation = true) AS
SELECT k1,
       (k1 + 0.0001)::text as k2,
       k1 as v1,
       (k1 + 0.0002):: varchar as v2 FROM generate_Series(1, 100000) k1;

-- w/o CBO
SELECT t1.k1,
      t1.k2,
       t2.v1,
      t2.v2
FROM   t1 join t2
        ON t1.k1 = t2.k1
WHERE  t1.k1 >= 2500
   and t1.k1 < 25100
   and t2.k1 >= 2500
   and t2.k1 < 25100
GROUP BY t1.k1, t1.k2, t2.v1, t2.v2;

--enable CBO
SET yb_enable_optimizer_statistics = true;
SELECT t1.k1,
      t1.k2,
       t2.v1,
      t2.v2
FROM   t1 join t2
        ON t1.k1 = t2.k1
WHERE  t1.k1 >= 2500
   and t1.k1 < 25100
   and t2.k1 >= 2500
   and t2.k1 < 25100
GROUP BY t1.k1, t1.k2, t2.v1, t2.v2;
tverona1 commented 1 year ago

In the compact repro above, we are not running ANALYZE. With CBO enabled, and without ANALYZE, we do not have cardinality estimates for the tables. Looks like we assume that cardinality is 1, so we end up with a nested loop join that in reality generates ~500M rows:

With CBO (without running ANALYZ prior):
 HashAggregate  (cost=0.02..0.03 rows=1 width=72)
   Group Key: t1.k1, t1.k2, t2.v1, t2.v2
   ->  Nested Loop  (cost=0.00..0.01 rows=1 width=72)
         Join Filter: (t1.k1 = t2.k1)
         ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=36)
               Remote Filter: ((k1 >= 2500) AND (k1 < 25100))
         ->  Seq Scan on t2  (cost=0.00..0.00 rows=1 width=40)
               Remote Filter: ((k1 >= 2500) AND (k1 < 25100))

Perhaps, if we are missing cardinality estimates, we should fall back to some hard-coded cardinality (i.e. 10k rows) like we do in the heuristic model.

tverona1 commented 1 year ago

Note - in vanilla PG, without stats, cardinality estimates are better. Perhaps it is estimating based on # of pages:

CREATE TABLE t1
    WITH (autovacuum_enabled = false) AS
SELECT k1,
       (k1 + 0.0001)::text as k2,
       k1 as v1,
       (k1 + 0.0002):: varchar as v2 FROM generate_Series(1, 50000) k1;

CREATE TABLE t2
    WITH (autovacuum_enabled = false) AS
SELECT k1,
       (k1 + 0.0001)::text as k2,
       k1 as v1,
       (k1 + 0.0002):: varchar as v2 FROM generate_Series(1, 100000) k1;
explain SELECT t1.k1,
      t1.k2,
       t2.v1,
      t2.v2
FROM   t1 join t2
        ON t1.k1 = t2.k1
WHERE  t1.k1 >= 2500
   and t1.k1 < 25100
   and t2.k1 >= 2500
   and t2.k1 < 25100
GROUP BY t1.k1, t1.k2, t2.v1, t2.v2;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Group  (cost=2464.43..2467.20 rows=222 width=72)
   Group Key: t1.k1, t1.k2, t2.v1, t2.v2
   ->  Sort  (cost=2464.43..2464.98 rows=222 width=72)
         Sort Key: t1.k1, t1.k2, t2.v1, t2.v2
         ->  Hash Join  (cost=816.98..2455.78 rows=222 width=72)
               Hash Cond: (t2.k1 = t1.k1)
               ->  Seq Scan on t2  (cost=0.00..1630.24 rows=298 width=40)
                     Filter: ((k1 >= 2500) AND (k1 < 25100))
               ->  Hash  (cost=815.12..815.12 rows=149 width=36)
                     ->  Seq Scan on t1  (cost=0.00..815.12 rows=149 width=36)
                           Filter: ((k1 >= 2500) AND (k1 < 25100))
mtakahar commented 1 year ago

In the compact repro above, we are not running ANALYZE.

I briefly tried it myself with TAQO on my MacBook Pro and couldn't reproduce it. Now I see it was because I specified to run the analyze script from the command line.

Perhaps, if we are missing cardinality estimates, we should fall back to some hard-coded cardinality (i.e. 10k rows) like we do in the heuristic model.

I agree. I think we should either create a separate issue or repurpose this to track that.

Perhaps it is estimating based on # of pages:

I believe that's what it does. Postgres also uses 1000 rows as the default in some places though.

mtakahar commented 1 year ago

16097 is the most likely cause of the same hash join plan taking much longer when yb_enable_optimizer_statistics is turned on but the tables are unanalyzed.

gauravk-in commented 11 months ago

When yb_enable_optimizer_statistics is disabled, YB uses the heuristic 1000 rows per table. When enabled, we take the value from pg_class.reltuples. This is 0 when ANALYZE is not run and gets “clamped” to 1.

The problem is that when the estimated output rows is 1, the CBO tends to chooses Nested Loop Join which is very expensive.

PG’s logic to estimate reltuples is in estimate_rel_size method. They compute the size of the row and divide the total size of the table by that to estimate the number of tuples. When the table is empty, they assume it has 10 pages. This is why their estimate is never 0.

We can fix the issue in YB in following ways,

@rthallamko3 Can you please comment how challenging it is to extract the size of SST tables from DocDB? We would need to expose this information to Cost Model and cache it somewhere too.

rthallamko3 commented 10 months ago

@gauravk-in , Is the SST size the only unknown for CBO or are there other aspects. I would recommend going down the simpler approach - If analyze is not run, then assume 1000. Once analyze is run, all the statistics are populated and are available?