ydb-platform / ydb

YDB is an open source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions
https://ydb.tech
Apache License 2.0
3.83k stars 531 forks source link

TPC DS s10 slowdown/timeout with pragma RotateJoinTree="false"; #6533

Open yumkam opened 2 months ago

yumkam commented 2 months ago

Affects q18, q21, q30, q40, q62, q64, q65, q81, q99 q18, q64, q81 timeouts, other just slower (q30 by x100)

Some queries can be fixed by enabling CBO (q40, q62, q99), some are not fixed or become worse (q30) (Probably, as CBO currently replaces GraceJoin with MapJoin?)

yumkam commented 1 month ago

tpc ds 10 q18 with RotateJoinTree=false as it enters CBO:

2024-08-07 12:33:18.103 INFO  dqrun(pid=1285165, tid=0x00007F6AE2296F80) [default] dq_opt_join_cost_based.cpp:324: {dummy_op} Converted join tree:
Join: (InnerJoin,Undefined) catalog_sales.cs_item_sk=item.i_item_sk,
    Join: (InnerJoin,Undefined) catalog_sales.cs_sold_date_sk=date_dim.d_date_sk,
        Join: (InnerJoin,Undefined) customer.c_current_addr_sk=customer_address.ca_address_sk,
            Join: (InnerJoin,Undefined) catalog_sales.cs_bill_customer_sk=customer.c_customer_sk,cd2.cd_demo_sk=customer.c_current_cdemo_sk,
                Join: (Cross,Undefined) 
                    Join: (InnerJoin,Undefined) catalog_sales.cs_bill_cdemo_sk=cd1.cd_demo_sk,
                        Rel: catalog_sales
                        Type: BaseTable, Nrows: 1.14987e+06, Ncols: 34, ByteSize: 1.17747e+09, Cost: 1.17747e+09, cs_item_sk, cs_order_number
                        Rel: cd1
                        Type: BaseTable, Nrows: 56.9186, Ncols: 9, ByteSize: 58284.7, Cost: 5.82847e+06, cd_demo_sk
                    Rel: cd2
                    Type: BaseTable, Nrows: 5691.86, Ncols: 9, ByteSize: 5.82847e+06, Cost: 5.82847e+06, cd_demo_sk
                Rel: customer
                Type: BaseTable, Nrows: 14772.3, Ncols: 18, ByteSize: 1.51269e+07, Cost: 2.52115e+07, c_customer_sk
            Rel: customer_address
            Type: BaseTable, Nrows: 3836.55, Ncols: 13, ByteSize: 3.92863e+06, Cost: 5.61233e+06, ca_address_sk
        Rel: date_dim
        Type: BaseTable, Nrows: 160.235, Ncols: 28, ByteSize: 164080, Cost: 1.6408e+06, d_date_sk
    Rel: item
    Type: BaseTable, Nrows: 7762.99, Ncols: 22, ByteSize: 7.9493e+06, Cost: 7.9493e+06, i_item_sk

2024-08-07 12:33:18.104 INFO  dqrun(pid=1285165, tid=0x00007F6AE2296F80) [default] dq_opt_join_cost_based.cpp:334: {dummy_op} Optimizied join tree:
Join: (InnerJoin,GraceJoin) catalog_sales.cs_bill_customer_sk=customer.c_customer_sk,cd2.cd_demo_sk=customer.c_current_cdemo_sk,
Type: ManyManyJoin, Nrows: 549772, Ncols: 133, ByteSize: 5.62967e+08, Cost: 1.23395e+09
    Join: (Cross,GraceJoin) 
    Type: ManyManyJoin, Nrows: 1.30898e+06, Ncols: 102, ByteSize: 1.3404e+09, Cost: 1.20122e+09
        Rel: cd2
        Type: BaseTable, Nrows: 5691.86, Ncols: 9, ByteSize: 5.82847e+06, Cost: 5.82847e+06, cd_demo_sk
        Join: (InnerJoin,GraceJoin) item.i_item_sk=catalog_sales.cs_item_sk,
        Type: FilteredFactTable, Nrows: 1149.87, Ncols: 93, ByteSize: 1.17747e+06, Cost: 1.19407e+09, cs_item_sk, cs_order_number
            Rel: item
            Type: BaseTable, Nrows: 7762.99, Ncols: 22, ByteSize: 7.9493e+06, Cost: 7.9493e+06, i_item_sk
            Join: (InnerJoin,GraceJoin) catalog_sales.cs_sold_date_sk=date_dim.d_date_sk,
            Type: FilteredFactTable, Nrows: 1149.87, Ncols: 71, ByteSize: 1.17747e+06, Cost: 1.18611e+09, cs_item_sk, cs_order_number
                Join: (InnerJoin,GraceJoin) catalog_sales.cs_bill_cdemo_sk=cd1.cd_demo_sk,
                Type: FilteredFactTable, Nrows: 11498.7, Ncols: 43, ByteSize: 1.17747e+07, Cost: 1.18446e+09, cs_item_sk, cs_order_number
                    Rel: catalog_sales
                    Type: BaseTable, Nrows: 1.14987e+06, Ncols: 34, ByteSize: 1.17747e+09, Cost: 1.17747e+09, cs_item_sk, cs_order_number
                    Rel: cd1
                    Type: BaseTable, Nrows: 56.9186, Ncols: 9, ByteSize: 58284.7, Cost: 5.82847e+06, cd_demo_sk
                Rel: date_dim
                Type: BaseTable, Nrows: 160.235, Ncols: 28, ByteSize: 164080, Cost: 1.6408e+06, d_date_sk
    Join: (InnerJoin,GraceJoin) customer.c_current_addr_sk=customer_address.ca_address_sk,
    Type: FilteredFactTable, Nrows: 10340.6, Ncols: 31, ByteSize: 1.05888e+07, Cost: 3.08566e+07, c_customer_sk
        Rel: customer
        Type: BaseTable, Nrows: 14772.3, Ncols: 18, ByteSize: 1.51269e+07, Cost: 2.52115e+07, c_customer_sk
        Rel: customer_address
        Type: BaseTable, Nrows: 3836.55, Ncols: 13, ByteSize: 3.92863e+06, Cost: 5.61233e+06, ca_address_sk

q18 without RotateJoinTree=false:

2024-08-07 12:34:18.300 INFO  dqrun(pid=1285330, tid=0x0000765B30316F80) [default] dq_opt_join_cost_based.cpp:297: {dummy_op} Optimizing join with costs
2024-08-07 12:34:18.300 INFO  dqrun(pid=1285330, tid=0x0000765B30316F80) [default] dq_opt_join_cost_based.cpp:310: {dummy_op} All statistics for join in place
2024-08-07 12:34:18.300 INFO  dqrun(pid=1285330, tid=0x0000765B30316F80) [default] dq_opt_join_cost_based.cpp:324: {dummy_op} Converted join tree:
Join: (InnerJoin,Undefined) customer_address.ca_address_sk=customer.c_current_addr_sk,
    Rel: customer_address
    Type: BaseTable, Nrows: 3836.55, Ncols: 13, ByteSize: 3.92863e+06, Cost: 5.61233e+06, ca_address_sk
    Join: (InnerJoin,Undefined) cd2.cd_demo_sk=customer.c_current_cdemo_sk,
        Rel: cd2
        Type: BaseTable, Nrows: 5691.86, Ncols: 9, ByteSize: 5.82847e+06, Cost: 5.82847e+06, cd_demo_sk
        Join: (InnerJoin,Undefined) customer.c_customer_sk=catalog_sales.cs_bill_customer_sk,
            Rel: customer
            Type: BaseTable, Nrows: 14772.3, Ncols: 18, ByteSize: 1.51269e+07, Cost: 2.52115e+07, c_customer_sk
            Join: (InnerJoin,Undefined) cd1.cd_demo_sk=catalog_sales.cs_bill_cdemo_sk,
                Rel: cd1
                Type: BaseTable, Nrows: 56.9186, Ncols: 9, ByteSize: 58284.7, Cost: 5.82847e+06, cd_demo_sk
                Join: (InnerJoin,Undefined) item.i_item_sk=catalog_sales.cs_item_sk,
                    Rel: item
                    Type: BaseTable, Nrows: 7762.99, Ncols: 22, ByteSize: 7.9493e+06, Cost: 7.9493e+06, i_item_sk
                    Join: (InnerJoin,Undefined) catalog_sales.cs_sold_date_sk=date_dim.d_date_sk,
                        Rel: catalog_sales
                        Type: BaseTable, Nrows: 1.14987e+06, Ncols: 34, ByteSize: 1.17747e+09, Cost: 1.17747e+09, cs_item_sk, cs_order_number
                        Rel: date_dim
                        Type: BaseTable, Nrows: 160.235, Ncols: 28, ByteSize: 164080, Cost: 1.6408e+06, d_date_sk

2024-08-07 12:34:18.300 INFO  dqrun(pid=1285330, tid=0x0000765B30316F80) [default] dq_opt_join_cost_based.cpp:334: {dummy_op} Optimizied join tree:
Join: (InnerJoin,GraceJoin) cd2.cd_demo_sk=customer.c_current_cdemo_sk,
Type: FilteredFactTable, Nrows: 482.946, Ncols: 133, ByteSize: 494536, Cost: 1.23075e+09, cs_item_sk, cs_order_number
    Rel: cd2
    Type: BaseTable, Nrows: 5691.86, Ncols: 9, ByteSize: 5.82847e+06, Cost: 5.82847e+06, cd_demo_sk
    Join: (InnerJoin,GraceJoin) item.i_item_sk=catalog_sales.cs_item_sk,
    Type: FilteredFactTable, Nrows: 482.946, Ncols: 124, ByteSize: 494536, Cost: 1.22492e+09, cs_item_sk, cs_order_number
        Rel: item
        Type: BaseTable, Nrows: 7762.99, Ncols: 22, ByteSize: 7.9493e+06, Cost: 7.9493e+06, i_item_sk
        Join: (InnerJoin,GraceJoin) customer_address.ca_address_sk=customer.c_current_addr_sk,
        Type: FilteredFactTable, Nrows: 482.946, Ncols: 102, ByteSize: 494536, Cost: 1.21696e+09, cs_item_sk, cs_order_number
            Rel: customer_address
            Type: BaseTable, Nrows: 3836.55, Ncols: 13, ByteSize: 3.92863e+06, Cost: 5.61233e+06, ca_address_sk
            Join: (InnerJoin,GraceJoin) customer.c_customer_sk=catalog_sales.cs_bill_customer_sk,
            Type: FilteredFactTable, Nrows: 689.922, Ncols: 89, ByteSize: 706481, Cost: 1.21134e+09, cs_item_sk, cs_order_number
                Rel: customer
                Type: BaseTable, Nrows: 14772.3, Ncols: 18, ByteSize: 1.51269e+07, Cost: 2.52115e+07, c_customer_sk
                Join: (InnerJoin,GraceJoin) catalog_sales.cs_sold_date_sk=date_dim.d_date_sk,
                Type: FilteredFactTable, Nrows: 1149.87, Ncols: 71, ByteSize: 1.17747e+06, Cost: 1.18611e+09, cs_item_sk, cs_order_number
                    Join: (InnerJoin,GraceJoin) catalog_sales.cs_bill_cdemo_sk=cd1.cd_demo_sk,
                    Type: FilteredFactTable, Nrows: 11498.7, Ncols: 43, ByteSize: 1.17747e+07, Cost: 1.18446e+09, cs_item_sk, cs_order_number
                        Rel: catalog_sales
                        Type: BaseTable, Nrows: 1.14987e+06, Ncols: 34, ByteSize: 1.17747e+09, Cost: 1.17747e+09, cs_item_sk, cs_order_number
                        Rel: cd1
                        Type: BaseTable, Nrows: 56.9186, Ncols: 9, ByteSize: 58284.7, Cost: 5.82847e+06, cd_demo_sk
                    Rel: date_dim
                    Type: BaseTable, Nrows: 160.235, Ncols: 28, ByteSize: 164080, Cost: 1.6408e+06, d_date_sk

cd2 (alias to customers_demographics) should be rewritten to INNER JOIN to customers table, but with RotateJoinTree=false it left as CROSS JOIN (and blows up everything)

P.S. if you look at query carefully, join with cd2 is actually no-op and can be just removed without affecting results: this is fk-pk relation, and its results are not used anywhere, neither in output, nor in WHERE conditions.

yumkam commented 1 month ago

tpc ds q30:

2024-08-07 03:36:21.457 INFO  dqrun(pid=1265529, tid=0x000079175FECEF80) [default] dq_opt_join_cost_based.cpp:324: {dummy_op} Converted join tree:
Join: (InnerJoin,Undefined) ctr1.ctr_customer_sk=customer.c_customer_sk,customer_address.ca_address_sk=customer.c_current_addr_sk,
    Join: (Cross,Undefined) 
        Join: (InnerJoin,Undefined) ctr1.ctr_state=ctr2.ctr_state,
            Rel: ctr1
            Type: FilteredFactTable, Nrows: 6403.62, Ncols: 8, ByteSize: 6.5573e+06, Cost: 7.29207e+07, wr_item_sk, wr_order_number
            Rel: ctr2
            Type: FilteredFactTable, Nrows: 6403.62, Ncols: 8, ByteSize: 6.5573e+06, Cost: 7.29207e+07, wr_item_sk, wr_order_number
        Rel: customer_address
        Type: BaseTable, Nrows: 548.079, Ncols: 2, ByteSize: 561233, Cost: 5.61233e+06, ca_address_sk
    Rel: customer
    Type: BaseTable, Nrows: 24620.6, Ncols: 14, ByteSize: 2.52115e+07, Cost: 2.52115e+07, c_customer_sk

Looks like customer_address ends up CROSS JOIN'ed to ctr1 and ctr2 join result, instead of INNER JOIN'ed to customer table

yumkam commented 1 month ago

q64: d2 and d3 are CROSS JOIN'ed to store_sales instead of enreaching customer table

yumkam commented 1 month ago

q21, q40, q62 were just deoptimized join order and fixed by CBO, others are #7565 and/or #7403

qrort commented 1 month ago

The problem

I looked into the queries. It seems that neither RotateJoinTree="false" or RotateJoinTree="true" will produce a correct optimization for them.

Some queries (namely, q18 or q64) can be correctly optimized by introducing sets of equal join keys, i. e. if we have two predicates like where a.key == b.key and where a.key == c.key we can automatically deduce a predicate where b.key == c.key and try to optimize a corresponding CrossJoin.

Other queries (like q30) can not be correctly optimized even with this change: some join rotation is required. Let's look in this example: t1 cross join t2 cross join t3 where t1.column1 = t3.column1 and t2.column2 = t3.column2.

image

Join tree looks like this, with RotateJoinTree="false" the first predicate optimizes root join, the other does nothing. And we can not improve the strategy.

Reverting to RotateJoinTree="true" also produces problems that we wanted to fix in the first place.

The solution

We could either try to optimize the tree in both ways and choose a better plan, or come up with a general-purpose algorithm that i'll explain below.

Let's consider an example and optimize this query:

t1 cross join t2 cross join t3 cross join t4 where t1.column1 = t3.column1 and t2.column2 = t3.column2 and t1.column1 = t4.column1.

The algorithm will do the following:

  1. Extract all filters and build equivalent join key classes. For the query, it'll be [t1.column1, t3.column1, t4.column1] and [t2.column2, t3.column2].
  2. Determine which tables can be joined together using one of the equality predicates, which are obtained from the classes above. The predicates are t1.column1 == t3.column1, t1.column1 == t4.column1, t3.column1 == t4.column1, t2.column2 == t3.column2. So the pairs of tables which can be joined with an inner join are [t1, t3], [t1, t4], [t3, t4], [t2, t3].
  3. Build a graph, where tables are vertexes and predicates are edges. Determine components of this graph. the query above has only one component.
  4. Inner join tables from each component together in any order (or in the order, that is the closest to user order), using join keys from corresponding predicates.
  5. If there is more than one component, use cross joins to connect them together.

Who will implement it and when it will be in main

If I commit to implementing the algo, I consider it to be merged after [1-3] weeks. I discussed the algo with @pavelvelikhov in private matter, and he mentioned that maybe CBO developers can take this issue, as they are already doing something similar in CBO.

qrort commented 1 month ago

I thought about it, and it seems to me we do not have to build equity classes of join keys, because connectivity component does not change and we can traverse it (and construct inner joins) even without extra predicates.

pavelvelikhov commented 1 month ago

I thought about it, and it seems to me we do not have to build equity classes of join keys, because connectivity component does not change and we can traverse it (and construct inner joins) even without extra predicates.

Yes! We don't really need equivalence classes! Even in this example: https://github.com/ydb-platform/ydb/issues/7403

изображение

The join graph looks like this, we just need to traverse it via any spanning tree, as long as we don't try to add unconnected nodes from the same component, everything will work. E.g. basic DFS from any node will work