DESIGN: Wishlist from scikit-learn, keras, tensorflow?

Thanks for reaching out :)

I think @ogrisel has thought more about this than me, and maybe @gaelvaroquaux and @jnothman too. Just thinking out loud for now.

I guess there are two main reasons why we would like to have access to dataframes on a C/Cython level:

We don't want to copy the data into a numpy array, simply to avoid the memory copy.
We want to use pandas features like categorical variables or missing values.

For the first one we ideally wouldn't want to write any pandas specific code. So if a dataframe could provide a cython typed memory view interface, that might solve the use-case -- though the question is whether that might be a lot slower than doing a copy if the memory is not aligned nicely?

For the second use-case, I would think that writing data-frame specific cython (restricted to the trees for categoricals and missing values, and to imputation for missing values) would be ok - supporting these data types directly without creating boolean masks might speed things up and make them much more convenient for the user.

We don't really want a pandas dependency, but if the DataFrame API was defined in Cython (that's how it goes for numpy, right?) that would probably work for us. In that case something like the typed memory view with indexing and slicing and the right data types would be enough? It might be that that's currently possible with pandas, I don't really know the code. I guess apart from our limited bandwidth, what kept us mostly from working more with dataframes was that we don't want to have a pandas dependency and that we don't want to code against the codebase--as opposed to a well-defined API.

I guess we are pretty simple in that we require homogeneous float dataframes for basically everything, apart from some corner-cases where we allow certain more general input. But the types of operations we do and the types of input we consume are pretty restricted. We're not gonna consume any complex nested data structures anytime soon (or hopefully ever).

wesm / pandas2

DESIGN: Wishlist from scikit-learn, keras, tensorflow? #52