zhouqingqing / qpmodel

A Relational Optimizer and Executor
MIT License
64 stars 18 forks source link

Change cardinality calculation for distributed plan and fix bugs #217

Closed arzuschen closed 3 years ago

arzuschen commented 3 years ago

The cardinality for distributed table is to be computed differently. Previously, the cardinality (card attributed in LogicNode) is used for computing the cost of PhysicNode. However, it can only represent serialized execution cost, and cost for distributed plan cannot be correctly represented. In the new calculation, the card attribute for logic node remains the same, representing the total number of rows to be processed. A new machinecount_ attribute is introduced to logic node to identify if the node is running in parallel.

The default machinecount is 1, and there are three main changes involving this attribute: 1) For table scan of distributed table (distributed on a column or roundrobin), machinecount will be set to the number of distributions of the table. 2) For join, the machinecount is the maximum of the child nodes' machinecount. 3) For gather, the machinecount_ will be set back to 1 to show that the output is singleton.

The Card() method for physic node now returns the average cardinality for each machine. (rounded up) Since the cost estimation process uses the Card() method under PhysicNode, it will be using the modified cardinality, meaning the number of rows per machine. Meanwhile, logic card_ remains unchanged.

Other Bugs that are fixed in this PR: 1) False assignment of having during agg split: The having expression is falsely changed during application of agg split rule, which may lead to expr matching issue. It is fixed by reassigning the unchanged having. 2) Using anydistribution strictly for distributed and not singleton: change the default distribution property to singleton to avoid ambiguity and error for join resolver enabled optimization.

zhouqingqing commented 3 years ago

For join, the machinecount is the maximum of the child nodes' machinecount.

Is that join both sides shall have the same machinecount_?

arzuschen commented 3 years ago

For join, the machinecount is the maximum of the child nodes' machinecount.

Is that join both sides shall have the same machinecount? not necessarily, it could be that joining a singleton with a distributed, or joining a replicated with distributed, in both cases it is joining machinecount = 1 with machinecount = 10, which will result in a machinecount = 10 output.