zhouqingqing / qpmodel

A Relational Optimizer and Executor
MIT License
66 stars 18 forks source link

Add distribution to physic property #194

Closed arzuschen closed 4 years ago

arzuschen commented 4 years ago

Distribution attribute is added into physic property. Since in the new framework, all the requirements are regarded as top-to-down requirements, there are 4 different types of distribution requirements: 1) Singleton: not distributed execution 2) Distributed (on Expr list): need to be hash distributed according to certain list of expressions 3) Replicated: data stream is replicated across the threads 4) Any: can be supplied by any distribution (i.e., anything will work)

The default base requirement is singleton and any order posed on top. For query concerning distributed tables, logic tree comes with remote exchange nodes like redistribution and gather. For memo optimization, these nodes are removed from the logic tree and distribution transformation is enabled. Previously, all distributed queries are blocked from using memo, now with an additional option parameter memo_useremoteexchange, they can be optimized by memo, and no change in the calculation for non-distributed queries. There are two special cases for distribution. Replicated distribution is not considered in Any since it uses a different gather enforcement node. For a hash join, a replicated requirement results in replicated requirement on both children. There is also a round robin for table data distribution, it is considered as distributed with empty expression list. It will need to be redistributed if any particular distributed is required but can satisfy any.

For example, this query: select a1, count(b1) from ad, b where a1 > b2 group by a1 order by a1 In this case, ad is distributed on a1. The base required property is <a1-asc, singleton> and the root group is aggregate. <a1-asc, singleton> is then transformed in to <none, singleton>, and that is further transformed into <none, any> and <none, replicated. Assume hashagg and streamagg is only compatible with singleton, so <none, any> and <none, replicated> cannot be supported. For this group, it may be that <a1-asc, singleton> is directly supported by streamagg, and same property requirement posed on the child group, and <none, singleton> may be supported by hashagg, the requirement on child group is then <none, singleton>. Consider join group after that similarly, the requirements are transformed to get <none, any> and <none, replicated>. For the case of hashjoin, <none, any> required on top will result in 4 different set of child requirements: {<none, singleton>, <none, dist on b2>} {<none, dist on a1>, <none, singleton>} {<none, dist on a1>, <none, dist on b2>} {<none, replicated>, <none, dist on b2>} For {<none, singleton>, <none, singleton>}, the result is singleton, that situation should be considered in the <none, singleton> requirement. Also, the join order is not demonstrated since it is included in the logic join transformation rules. If there are more than one set of child requirements, the inclusive cost for each set is calculation and the least-cost set is select. In this case, it may be that {<none, dist on a1>, <none, singleton>} is the best set since no enforcement will be needed for the table scan.