Fix #issue72 subquery in from clause

9DemonFox commented 3 years ago

According to the paper The Complete Story of Joins (in HyPer)

translate the select clause a) translate the result expressions just like in the where clause b) the result is added to the top of the current tree using Π

zhouqingqing commented 3 years ago

the title is more like subquery in selection list
do you want to test one (subquery with correlated subquery) in selection list?
test two subqueries in selection list, with a correlated subquery in the main query

9DemonFox commented 3 years ago

I test this sql as you suggested.

the title is more like subquery in selection list do you want to test one (subquery with correlated subquery) in selection list? test two subqueries in selection list, with a correlated subquery in the main que

select c_name,
    (select count(O.o_orderkey) 
        from orders O 
        where O.o_custkey = 37) as OrderCount, 
    (select count(O2.o_custkey) 
        from orders O2
        where O2.o_custkey = 26) as CustPhone
from customer C
where exists
    (
        select *
        from nation N
        where C.c_nationkey=N.n_nationkey
        and N.n_regionkey=0
    )

It got a wrong result which return zero row(in postgreSQL, it returned 29 rows), but the most top of the plan show that rows=30 .

Total cost: 6511, memory=20
PhysicFilter  (inccost=6511, cost=30, rows=30) (actual rows=0)
    Output: c_name[0],{count(o_orderkey)}[1],{count(o_custkey)}[2]
    Filter: {#marker}[0]
    -> PhysicFilter  (inccost=6481, cost=30, rows=30) (actual rows=150)
        Output: c_name[0],{count(o_orderkey)}[1],{count(o_custkey)}[2]
        Filter: true
        -> PhysicNLJoin Left (inccost=6451, cost=440, rows=30) (actual rows=150)
            Output: c_name[0],{count(o_orderkey)}[1],{count(o_custkey)}[2]
            -> PhysicFilter  (inccost=4486, cost=30, rows=30) (actual rows=150)
                Output: c_name[0],{count(o_orderkey)}[1]
                Filter: true
                -> PhysicNLJoin Left (inccost=4456, cost=440, rows=30) (actual rows=150)
                    Output: c_name[1],{count(o_orderkey)}[2]
                    -> PhysicMarkJoin Left (inccost=2475, cost=750, rows=30) (actual rows=150)
                        Output: #marker,c_name[0]
                        Filter: c_nationkey[1]=n_nationkey[2]
                        -> PhysicGather Threads: 10 (inccost=1650, cost=1500, rows=150) (actual rows=150)
                            Output: c_name[1],c_nationkey[3]
                            -> PhysicScanTable customer as c (inccost=150, cost=150, rows=150) (actual rows=15, loops=10)
                                Output: c_name[1],c_nationkey[3]
                        -> PhysicGather Threads: 10 (inccost=75, cost=50, rows=5) (actual rows=5, loops=150)
                            Output: n_nationkey[0]
                            -> PhysicScanTable nation as n (inccost=25, cost=25, rows=5) (actual rows=5)
                                Output: n_nationkey[0]
                                Filter: n_regionkey[2]=0
                    -> PhysicHashAgg  (inccost=1541, cost=3, rows=1, memory=2) (actual rows=1, loops=150)
                        Output: {sum({count(o_orderkey)})}[0]
                        Aggregates: sum({count(o_orderkey)}[0])
                        Group by: 
                        -> PhysicGather Threads: 10 (inccost=1538, cost=10, rows=1) (actual rows=9, loops=150)
                            Output: {count(o_orderkey)}[0]
                            -> PhysicHashAgg  (inccost=1528, cost=28, rows=1, memory=8) (actual rows=0, loops=10)
                                Output: {count(o_orderkey)}[0]
                                Aggregates: count(o_orderkey[0])
                                Group by: 
                                -> PhysicScanTable orders as o (inccost=1500, cost=1500, rows=26) (actual rows=2, loops=10)
                                    Output: o_orderkey[0]
                                    Filter: o_custkey[1]=37
            -> PhysicHashAgg  (inccost=1525, cost=3, rows=1, memory=2) (actual rows=1, loops=150)
                Output: {sum({count(o_custkey)})}[0]
                Aggregates: sum({count(o_custkey)}[0])
                Group by: 
                -> PhysicGather Threads: 10 (inccost=1522, cost=10, rows=1) (actual rows=7, loops=150)
                    Output: {count(o_custkey)}[0]
                    -> PhysicHashAgg  (inccost=1512, cost=12, rows=1, memory=8) (actual rows=0, loops=10)
                        Output: {count(o_custkey)}[0]
                        Aggregates: count(o_custkey[0])
                        Group by: 
                        -> PhysicScanTable orders as o2 (inccost=1500, cost=1500, rows=10) (actual rows=1, loops=10)
                            Output: o_custkey[1]
                            Filter: o_custkey[1]=26

The plan is similar to HyperDB， but it return zero row. And I don't know what is the Filter:{#marker}[0]

PhysicFilter  (inccost=6511, cost=30, rows=30) (actual rows=0)
    Output: c_name[0],{count(o_orderkey)}[1],{count(o_custkey)}[2]
    Filter: {#marker}[0]

zhouqingqing commented 3 years ago

And I don't know what is the Filter:{#marker}[0]

SingleJoin will generate an extra marker column (boolean) which indicates if the corresponding row is matched. So 'Filter:{#maker}[0]' is a filter based on this marker column.

Looks like all markers are false? Can you check if that's the case - there shall be 30 true to get 30 rows. Also, try to simplify the query a bit (i.e., simplify the query itself, or disable distributed plan etc) and see if you can isolate the problem.

zhouqingqing commented 3 years ago

btw, unittest is failing - we can checkin this fix first and leave complicated cases as a fix but unittest must pass.

9DemonFox commented 3 years ago

btw, unittest is failing - we can checkin this fix first and leave complicated cases as a fix but unittest must pass.

sorry, I push this by mistake. The problem was caused by the logic annotated in the project.

//request from child including reqOutput and filter
List<int> ordinals = new List<int>();
List<Expr> reqFromChild = new List<Expr>(reqOutput);
if (filter_ != null)
    reqFromChild.Add(filter_);

the out put marker was projected out, so it get zero row.

9DemonFox commented 3 years ago

There are many problems, and I think the cause of this is that the code so far is trying to construct the plan and unnesting the subquery at the same time.

LogicNode setReturningExprCreatePlan(LogicNode root, Expr expr)
        {
            var newroot = root;
            var subplans = new List<NamedQuery>();
            expr.VisitEachT<Expr>(e =>
            {
                if (e is SubqueryExpr x)
                {
                    Debug.Assert(expr.HasSubQuery());
                    x.query_.CreatePlan();
                    subplans.Add(new NamedQuery(x.query_, null));

                    // functionally we don't have to do rewrite since above
                    // plan is already runnable
                    if (queryOpt_.optimize_.enable_subquery_unnest_) 
                    {
                        // use the plan 'root' containing the subexpr 'x'
                        var replacement = oneSubqueryToJoin(root, x);
                        newroot = (LogicNode)newroot.SearchAndReplace(root,
                                                                replacement);
                    }
                }

In fact, according to the paper, they are different two steps as the pic is showing blow.

9DemonFox commented 3 years ago

I do meet problem when I try this sql, which contains two markjoin, the path o this sql is "qpmodel\tpch\select\sql3.sql"

select c_name,
    (select count(O.o_orderkey) 
        from orders O 
        where O.o_custkey = 37) as OrderCount, 
    (select count(O2.o_custkey) 
        from orders O2
        where O2.o_custkey = 26) as CustPhone
from customer C
where exists
    (
        select *
        from nation N
        where C.c_nationkey=N.n_nationkey
        and N.n_name like 'A%'
        and exists(
            select * 
            from region R
            where N.n_regionkey = R.r_regionkey
            and R.r_regionkey<2
        )
    )

First is that the output dosen't contain the markJoin, so i get zero output last time. Secont, I will meet problem whether I turn enable_subqueryunnest on or not. When I turn it on, there will be a Assert(filter !=null) error. third, the column ordinal problem, The #mark expr is handled in a very ideal way whitch can only handle one markerJoin. So I suggest to seperate the two steps and realize the algorithm in the way of "The Complete Story of Joins"

9DemonFox commented 3 years ago

The unittest can't pass because the output folder is ignored. And I learned that the .gitkeep can be added only in the network repository.

zhouqingqing commented 3 years ago

The #mark expr is handled in a very ideal way whitch can only handle one markerJoin

Yes, this is a known problem and need to be fixed. One way is to use Expr._ to do the matching - I think we are doing matching by its special naming?

I think the cause of this is that the code so far is trying to construct the plan and unnesting the subquery at the same time.

I am not against separating into two stages but essentially we will see what changes it make. Can you sketch your proposed changes?

9DemonFox commented 3 years ago

Infact, the #markjoin was described and defined in the The Complete Story of Joins (in HyPer) , it also offer a very complete and detailed algorithm to use markjoin. I use the algorithm described in this paper translate some complecated sql in to the relational algebra. Accroidng to this algorithm, The Hyper translate the SQL above into this shape below. Then we can use the algorithm in Unnesting Arbitrary Queries to unnest it Like the pic blow. Then we can do predicate pushdown et.al. What more confused me is that the column ordianl. As we introduce markjoin, the original columnordinalresolver should be changed, as it has filted the #maker. However, when i try to trace the ordinal of the special #marker column, I find that there are really too many columnordinalresolvers. And the Filter filter the columns by columnordinal. But the Hyper may offered a wiser way to do that as blow, it give every value and operator a unique ID respectively. Then we cound save many patience from the column ordoinal for other problems. But as the columnordinalresovlers seems work well when there is only one #marker(because we don't need to keep the #marker), we can also try to comlete it when there are more #markjoin...

zhouqingqing commented 3 years ago

Infact, the #markjoin was described and defined in the The Complete Story of Joins (in HyPer)

This is the algorithm implemented in QPModel.

I find that there are really too many columnordinalresolvers

Column Ordinal resolution is done in a top down fasion: the top output list is pushed to its child node and this node can add its own requirements (say a filter) to this list and further pushed down to its children. That's why ResolveColumnOrdinal is implmented as a virtual function for each logic node.

But the Hyper may offered a wiser way to do that as blow, it give every value and operator a unique ID respectively

Checkout TreeNode._, this is the uniqueID maintained plan wise.

zhouqingqing / qpmodel

Fix #issue72 subquery in from clause #233