zzeekk / spark-temporalquery

Implicit functions for querying interval data with Apache Spark/Scala
MIT License
2 stars 1 forks source link

Cannot rejoin dataframe with 2 equally named columns despite aliases #33

Open kaelte opened 4 years ago

kaelte commented 4 years ago

One cannot rejoin the result of an inner join if it contains 2 equally named columns despite different aliases:

val dfL = dfLeft.withColumnRenamed("wert_l","wert").as("dfL")
val dfM = dfRight.withColumnRenamed("wert_r","wert").as("dfM")
val dfLM = dfL.temporalInnerJoin(dfM,Seq("id"))
assert(3 == dfLM.select($"id",$"dfL.wert",$"dfM.wert").count())
dfLM.printSchema()

root
 |-- id: integer (nullable = false)
 |-- wert: double (nullable = false)
 |-- wert: double (nullable = true)
 |-- gueltig_ab: timestamp (nullable = true)
 |-- gueltig_bis: timestamp (nullable = true)

val dfR = dfMap.withColumnRenamed("img","wert").as("dfR")
val actual = dfLM.temporalLeftJoin(dfR,Seq("id"))

Reference 'wert' is ambiguous, could be: dfL.wert, dfM.wert.;
org.apache.spark.sql.AnalysisException: Reference 'wert' is ambiguous, could be: dfL.wert, dfM.wert.;
...
at ch.zzeekk.spark.temporalquery.TemporalQueryUtil$$anonfun$2.apply(TemporalQueryUtil.scala:213)
...

https://github.com/zzeekk/spark-temporalquery/blob/5f054bfe23ee948dbdd3917c117f0a349fa022d9/src/main/scala/ch/zzeekk/spark/temporalquery/TemporalQueryUtil.scala#L213

kaelte commented 4 years ago

@zzeekk @pgruetter I have created the new branch bug/issue_33 with a test case to highlight this issue: https://github.com/zzeekk/spark-temporalquery/blob/6292907ab7dde3cbef20ed81a36fcd5bf41bc513/src/test/scala/ch.zzeekk.spark.temporalquery/TemporalQueryUtilTest.scala#L771

kaelte commented 4 years ago

@zzeekk please note that in SQL you can have a second join without creating unique columns after the first join:

postgres (PostgreSQL) 12.2

with
 l as (select * from (values (1, 3.14),(2, 2.75)) t(id,wert))
,m as (select * from (values (1, -17.4),(2, -32.5)) t(id,wert))
,r as (select * from (values (1, 42.3)) t(id,wert))
select *
from   l
       inner join m on l.id=m.id
       left outer join r on l.id=r.id
;

Oracle 18c

with
 l as (select 1 ID, 3.14 Wert from dual union all select 2 ID, 2.75 Wert from dual)
,m as (select 1 ID, -17.4 Wert from dual union all select 2 ID, -32.5 Wert from dual)
,r as (select 1 ID, 42.3 Wert from dual)
select *
from   l
       inner join m on l.id=m.id
       left outer join r on l.id=r.id
;
zzeekk commented 4 years ago

Hi @kaelte, that works but it's not the same. You join with "on" condition and not with "using". The following corresponds to the above spark query and it's working, your right...

with
 l as (select 1 ID, 3.14 Wert from dual union all select 2 ID, 2.75 Wert from dual)
,m as (select 1 ID, -17.4 Wert from dual union all select 2 ID, -32.5 Wert from dual)
,r as (select 1 ID, 42.3 Wert from dual)
select *
from l inner join m using (id)
left outer join r using (id)
;
kaelte commented 4 years ago

So, we should try to make temporalJoins to work as well. I am doing some experiments right now but I do not know where it leads to. See branch bug/issue_33

kaelte commented 4 years ago

Test multipleJoins

The problem is that after joining dfL with dfM to dfLM in https://github.com/zzeekk/spark-temporalquery/blob/9a7e82427d22ef18441a1b802af271ef1e35c12e/src/test/scala/ch.zzeekk.spark.temporalquery/TemporalQueryUtilTest.scala#L819 the aliases dfL and dfM of the 2 columns wert are lost.

One could retrieve them from the logical plan by going recursively to the last subquery alias where the 2 colums wert occur. Search for the last alias of columns wert#153 and wert#158 in

Project [coalesce(id#9, id#26) AS id#197, wert#153, wert#158, greatest(gueltig_ab#10, gueltig_ab2#163) AS gueltig_ab#198, least(gueltig_bis#11, gueltig_bis2#168) AS gueltig_bis#199]
+- Join Inner, (((true && (id#9 = id#26)) && (gueltig_ab#10 <= gueltig_bis2#168)) && (gueltig_bis#11 >= gueltig_ab2#163))
   :- SubqueryAlias `dfL`
   :  +- Project [id#9, gueltig_ab#10, gueltig_bis#11, wert_l#12 AS wert#153]
   :     +- Project [...]
   :        +- LocalRelation [...]
   +- Project [id#26, gueltig_ab2#163, gueltig_bis#28 AS gueltig_bis2#168, wert#158]
      +- Project [id#26, gueltig_ab#27 AS gueltig_ab2#163, gueltig_bis#28, wert#158]
         +- SubqueryAlias `dfM`
            +- Project [id#26, gueltig_ab#27, gueltig_bis#28, wert_r#29 AS wert#158]
               +- Project [...]
                  +- LocalRelation [...]
zzeekk commented 2 years ago

Had a similar issue with columns ambiguous after several joins with Spark 3.x, including some self-joins... I was able to solve it by renaming join columns and implemented it, see commit 54faf9cc616df8d248f0dc4bde42c40b48df3f4a.