wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

DESIGN: Wishlist from scikit-learn, keras, tensorflow? #52

Open wesm opened 7 years ago

wesm commented 7 years ago

What can pandas provide in the way of a C/C++/Cython API to better enable upstack ML / statistical libraries? @ogrisel @amueller, who might have some good perspectives?

amueller commented 7 years ago

Thanks for reaching out :)

I think @ogrisel has thought more about this than me, and maybe @gaelvaroquaux and @jnothman too. Just thinking out loud for now.

I guess there are two main reasons why we would like to have access to dataframes on a C/Cython level:

For the first one we ideally wouldn't want to write any pandas specific code. So if a dataframe could provide a cython typed memory view interface, that might solve the use-case -- though the question is whether that might be a lot slower than doing a copy if the memory is not aligned nicely?

For the second use-case, I would think that writing data-frame specific cython (restricted to the trees for categoricals and missing values, and to imputation for missing values) would be ok - supporting these data types directly without creating boolean masks might speed things up and make them much more convenient for the user.

We don't really want a pandas dependency, but if the DataFrame API was defined in Cython (that's how it goes for numpy, right?) that would probably work for us. In that case something like the typed memory view with indexing and slicing and the right data types would be enough? It might be that that's currently possible with pandas, I don't really know the code. I guess apart from our limited bandwidth, what kept us mostly from working more with dataframes was that we don't want to have a pandas dependency and that we don't want to code against the codebase--as opposed to a well-defined API.

I guess we are pretty simple in that we require homogeneous float dataframes for basically everything, apart from some corner-cases where we allow certain more general input. But the types of operations we do and the types of input we consume are pretty restricted. We're not gonna consume any complex nested data structures anytime soon (or hopefully ever).