zhouqingqing / qpmodel

A Relational Optimizer and Executor
MIT License
64 stars 18 forks source link

CTE optimization based on CTE producer/consumer model #307

Closed 9DemonFox closed 3 years ago

9DemonFox commented 3 years ago
  1. inline cte only use one time
  2. cost based cte
  3. predict pushdown in cte is handled now

still need to perfect:

  1. FROM can handle alias tabel name, so I don't emit eliminate FROM under CTEProducer and CTESelect.
  2. CTEAnchor Node and CTESelect Node do nothing and they should be eliminated too.
  3. subquery in CTE is not well handled due to some thing in Bind. so q58 and q83 in tpcds can't run well.
9DemonFox commented 3 years ago

1. InitCteInfo

image

every SQLstatment has a member CteInfo, and each has a CteInfoEntries. For convenience, all of the members are declared as PUBLIC.

1.1 See cte as "select * from cteQuery", so it's a SelectStmt. and we can bind and create plan for it.

when meet with cte as (cteQuery), create a plan for it , see CteSelectStmt's constructor.

    public class CteSelectStmt : SelectStmt
    {
        private CteExpr originCte_;
        public int CteId() => originCte_.cteId_;
        // we have to constract a CteSelection
        // select * from cte
        public CteSelectStmt(CteExpr originCte, CTEQueryRef from, List<CteExpr> previousCtes) : base(
                    new List<Expr>() { new SelStar(originCte.cteName_) },
                    new List<TableRef>() { from },
                    null, null, null,
                    previousCtes, null, null, null, $"select * from {originCte.cteName_}")
        {
            originCte_ = originCte;
        }
    }

1.2 add previous cte to current cte's context

in InitCteInfo, bind and create plan for each CTE, and add previous CTEs into current cte’s context and ctes_, so if current cte refer to previous ctes, it can find them.

List<CteExpr> previousCtes = cteInfo_.GetAllCteExprs();

// prevoious ctes have smaller cteId
Debug.Assert(previousCtes.All(x => x.cteId_ < curCte.cteId_));

SelectStmt cieStmt = new CteSelectStmt(curCte, from, previousCtes);

1.3 add cte to subquery's context so subqueries can find ctes

2. how to delete unused CTE

2.1 Example

It seems easy to do so, if a cte is not referred, it should be delete, but the paper put forward a extreme stupid situation. see:

sql1:

"with cte0 as (select * from a),cte1 as (select * from cte0) select * from b where b.b1 = 2";

cte0 has been referred by cte1, but cte1 is never used. here we use a stack to remove cte0.

TU.ExecuteSQL(sql, "2,3,4,5", out phyplan, option);
var answer = @"PhysicScanTable b (actual rows=1)
                Output: b.b1[0],b.b2[1],b.b3[2],b.b4[3]
                Filter: b.b1[0]=2";
TU.PlanAssertEqual(answer, phyplan);

2.2 Algorithms

Algorithm findUnusedCTE
INPUT: cteStack
OUTPUT: cte will be marked used or not.
step1   findCteConsumer(rootPlan,&cteStack)
step2 
        for subq in subqueres:
            findCteConsumer(subq,&cteStack)
step3   while not cteStack.empty():
            cteid = cteStack.pop()
            findCteConsumer(GetCteInfoEntryByCteId(cteid).plan_,&cteStack)

findCteConsumer will find CteConsumer from LogicNode recursively, push and mark cteInfoEntry as used.

For instance, in sql1, step1 will not push cte1 into cteStack, step2 will not either. so the cte1 will not be marked used. so the cte0 will not be marked used too.

2.3 Confirm the Algorithm and Explan the example

For sql1, cte1 will not be marked used, so it will not push into the cteStack, so cte1 is not mark used. so they are all deleted.

3. Logic Plan

image

3.1 LogicCteAnchor

3.1.1 where Cte Anchor show up.

the LogicCteAnchor will be add at the root of a plan, such as (b) .

3.1.2 where is the code

the realize code is bellow in cteToAnchor

            for (int i = 0; i < cteIsUsed.Count; i++)
            {
                var lca = new LogicCteAnchor(cteInfo_.GetCteInfoEntryByCteId(cteIsUsed[i].cteId_));
                lca.children_.Add(root);
                root = lca;
            }
            return root;

3.2 LogicCteProducer

3.1.1 where Cte Producer show up.

the plan produced for select * fom (cteQuey) is the plan for CTEProducer.

the LogicCteProdicer will be produced through RuleTrans

3.1.2 where is the code

the child of the LogicCteProducer is the select * from cteQuery , see section 1.1.

        public LogicCteProducer(CteInfoEntry cteInfoEntry)
        {
            cteInfoEntry_ = cteInfoEntry;
            children_.Add(cteInfoEntry.plan_);
            // CTEProducer has only one child
            Debug.Assert(children_.Count() == 1);
        }

3.3 LogicCteConsumer

cteQueryRef will be transform as LogicCteConsumer.

3.4 Logic Sequence

3.4.1 where it show up

RuleTrans CteAnchor2CteProd will produce a LogicSequenceNode, and it has the same MemoSignature with CTEAnchor for it.

4. Cost-Based Optimized CTE

4.1 how to produce a memo as this figure below

image

the LogicCteProducer will be add through RulesTrans. when we meet a LogicCTEAnchor, it will producer a LogicSequence Node, and the LogicSequence will add CTEProducer Node as its child.

when meet CTEConsumer, it will producer LogicCTESelect and has the same Soginature with CTEConsumer. it will also create nodes from cteplan. To avoid Group7 produce the same Signature with Group6, the Signature have to consider if it is belong to a child of cte consumer, cteId and cteRefId.

group is child of cte consumer cteid cteRefId Signature
Group6 false 1 -
Group7 true 1 1
Group8 true 1 2

by this , cteid = 2 and cteRefid = 1 will has the same signature with cteid = 1 and cteRefid=2 will has the same Signature. To avoid this , see cteRefId as string.

group is child of cte consumer cteid cteRefId Signature
Group6 false 1 -
Group7 true 1 ref1
Group8 true 1 ref2

4.2 select plan

image Figure 4.1

4.2.1 how to avoid plan(a)

A LogicCteSequence which is a binary operator will bind to a cte, so it has its cteid and will generate requirement (p 0) for left child and (c 0)for right child. And a LogicCteAnchor will generate no requirement for its child.

requirement for CTE will always pass to children and never disappear. The Group Member of a Group will satisfied the requirement only when it's cteSpec is the sub set of the CteReq

a LogicCteAnchor will generate no requirement for its child, when meet consumer0, consumer0 will seen as [(c 0)], it is not the subset of []. So the plan will not pass.

image

4.2.2 c and d are valid

In fact, c and d should be avoid. But now we don't support transform property from left to right child, they are not avoid.

But they will be optimized out by cost.

cost-based select will not choose plan like (c) and (d)

image image