yhat / pandasql

sqldf for pandas
MIT License
1.33k stars 185 forks source link

Left Join functioning same as Inner Join #64

Open chelsbells opened 6 years ago

chelsbells commented 6 years ago

When running a Left join, rows in the left table without a match in the right table are excluded from results table. This emulates the expected results of an inner join, whereas in a left join, one would expect these excluded rows to remain with nulls for the values of the joined fields.

I've set up the following dummy tables as pandas data frames.

R = pd.DataFrame({'jid':[1,3,1,2,3], 'date':['2000-02-04','2000-01-05','2000-01-30' \ ,'2000-03-10','2000-04-28'], 'amount':[1,2,3,4,5]}) L = pd.DataFrame({'jid':[1,1,2,3], 'start': ['2000-01-01', '2000-01-02', '2000-03-01' \ , '2000-05-01'], 'end': ['2000-01-31', '2000-02-28', '2000-03-31', '2000-05-31'] })

R.date = pd.to_datetime(R.date) L.start = pd.to_datetime(L.start) L.end = pd.to_datetime(L.end)

Table R

jid date amount
1 2000-02-04 1
3 2000-01-05 2
1 2000-01-30 3
2 2000-03-10 4
3 2000-04-28 5

Table L

jid start end
1 2000-01-01 2000-01-31
1 2000-02-01 2000-02-28
2 2000-03-01 2000-03-31
3 2000-05-01 2000-05-31

sqlLeft = """ SELECT L., R. FROM L LEFT JOIN R ON L.jid = R.jid WHERE R.date BETWEEN L.start AND L.end """ sqlin = """ SELECT L., R. FROM L INNER JOIN R ON L.jid = R.jid WHERE R.date BETWEEN L.start AND L.end """

resultsL = pandasql.sqldf( sqlLeft, locals() ) resultsin = pandasql.sqldf( sqlin, locals() )

Expected Results of Left Join

jid start end jid date amount
1 2000-01-01 2000-01-31 1 2000-01-30 3
1 2000-02-01 2000-02-28 1 2000-02-04 1
2 2000-03-01 2000-03-31 2 2000-03-10 4
3 2000-05-01 2000-05-31 NaN NaN NaN

Actual Results of Left Join

jid start end jid date amount
1 2000-01-01 2000-01-31 1 2000-01-30 3
1 2000-02-01 2000-02-28 1 2000-02-04 1
2 2000-03-01 2000-03-31 2 2000-03-10 4

Results of Inner Join

jid start end jid date amount
1 2000-01-01 2000-01-31 1 2000-01-30 3
1 2000-02-01 2000-02-28 1 2000-02-04 1
2 2000-03-01 2000-03-31 2 2000-03-10 4
yumok commented 5 years ago

If you use R parameters in the where clause, NULL values don't evaluate to true and are dismissed (so you get an INNER JOIN)

change your sqlLeft to """ SELECT L., R. FROM L LEFT JOIN R ON L.jid = R.jid AND R.date BETWEEN L.start AND L.end """