Open kaelte opened 4 years ago
@zzeekk @pgruetter I have created the new branch bug/issue_33 with a test case to highlight this issue: https://github.com/zzeekk/spark-temporalquery/blob/6292907ab7dde3cbef20ed81a36fcd5bf41bc513/src/test/scala/ch.zzeekk.spark.temporalquery/TemporalQueryUtilTest.scala#L771
@zzeekk please note that in SQL you can have a second join without creating unique columns after the first join:
with
l as (select * from (values (1, 3.14),(2, 2.75)) t(id,wert))
,m as (select * from (values (1, -17.4),(2, -32.5)) t(id,wert))
,r as (select * from (values (1, 42.3)) t(id,wert))
select *
from l
inner join m on l.id=m.id
left outer join r on l.id=r.id
;
with
l as (select 1 ID, 3.14 Wert from dual union all select 2 ID, 2.75 Wert from dual)
,m as (select 1 ID, -17.4 Wert from dual union all select 2 ID, -32.5 Wert from dual)
,r as (select 1 ID, 42.3 Wert from dual)
select *
from l
inner join m on l.id=m.id
left outer join r on l.id=r.id
;
Hi @kaelte, that works but it's not the same. You join with "on" condition and not with "using". The following corresponds to the above spark query and it's working, your right...
with
l as (select 1 ID, 3.14 Wert from dual union all select 2 ID, 2.75 Wert from dual)
,m as (select 1 ID, -17.4 Wert from dual union all select 2 ID, -32.5 Wert from dual)
,r as (select 1 ID, 42.3 Wert from dual)
select *
from l inner join m using (id)
left outer join r using (id)
;
So, we should try to make temporalJoins to work as well. I am doing some experiments right now but I do not know where it leads to. See branch bug/issue_33
The problem is that after joining dfL with dfM to dfLM in https://github.com/zzeekk/spark-temporalquery/blob/9a7e82427d22ef18441a1b802af271ef1e35c12e/src/test/scala/ch.zzeekk.spark.temporalquery/TemporalQueryUtilTest.scala#L819 the aliases dfL and dfM of the 2 columns wert are lost.
One could retrieve them from the logical plan by going recursively to the last subquery alias where the 2 colums wert occur.
Search for the last alias of columns wert#153
and wert#158
in
Project [coalesce(id#9, id#26) AS id#197, wert#153, wert#158, greatest(gueltig_ab#10, gueltig_ab2#163) AS gueltig_ab#198, least(gueltig_bis#11, gueltig_bis2#168) AS gueltig_bis#199]
+- Join Inner, (((true && (id#9 = id#26)) && (gueltig_ab#10 <= gueltig_bis2#168)) && (gueltig_bis#11 >= gueltig_ab2#163))
:- SubqueryAlias `dfL`
: +- Project [id#9, gueltig_ab#10, gueltig_bis#11, wert_l#12 AS wert#153]
: +- Project [...]
: +- LocalRelation [...]
+- Project [id#26, gueltig_ab2#163, gueltig_bis#28 AS gueltig_bis2#168, wert#158]
+- Project [id#26, gueltig_ab#27 AS gueltig_ab2#163, gueltig_bis#28, wert#158]
+- SubqueryAlias `dfM`
+- Project [id#26, gueltig_ab#27, gueltig_bis#28, wert_r#29 AS wert#158]
+- Project [...]
+- LocalRelation [...]
Had a similar issue with columns ambiguous after several joins with Spark 3.x, including some self-joins... I was able to solve it by renaming join columns and implemented it, see commit 54faf9cc616df8d248f0dc4bde42c40b48df3f4a.
One cannot rejoin the result of an inner join if it contains 2 equally named columns despite different aliases:
https://github.com/zzeekk/spark-temporalquery/blob/5f054bfe23ee948dbdd3917c117f0a349fa022d9/src/main/scala/ch/zzeekk/spark/temporalquery/TemporalQueryUtil.scala#L213