wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

Unified merge API #31

Open chrisaycock opened 8 years ago

chrisaycock commented 8 years ago

We have merge() and merge_asof(). There may even come a time when we perform functions on overlapping columns. As someone who wants to join two tables together, I just want a single mechanism to do so.

I wonder if it's possible to have a single API like:

merge(
    left,     # DataFrame or Table
    right,    # DataFrame or Table
    on,       # one or more columns
    asof,     # one or more columns
    how,      # 'left', 'right', 'inner', 'outer'
    overlap,  # optional function to apply to overlapping column names
)

Users must specify at least one of on or asof. There can also be left_on/right_on and left_asof/right_asof. We could even have left_index/right_index for the poor souls who still have indexed data (https://github.com/pydata/pandas-design/issues/17).

The overlap is for when the same column name appears in both tables. Currently those columns are renamed with a suffix (though I'd be in favor of just raising an error). But there are a times when I want to perform a function. There are ways to do this with arithmetic operations (https://github.com/pydata/pandas-design/issues/30), though I think any function with two arguments would be nice, including overwritting the left with the right (for handling cases of missing data with a "fill" result).

Note that doesn't handle my proposed merge_window() (https://github.com/pydata/pandas/issues/13959). The semantics there are very specific and I'm not sure how to put that in a unified structure as with above, though I'd love to hear any ideas.