zarr-developers / community

An open community with an interest in developing and using new technologies for tensor data storage.
19 stars 1 forks source link

DataFrames & Zarr #31

Open tbenst opened 4 years ago

tbenst commented 4 years ago

Hi, I recently learned about Zarr and very interested in it as seems to solve some issues I have with HDF5.

Increasingly, DataFrames, as popularized by R and now widely used in Python (pandas) and Julia, are a critical structure in data-science. I understand that Xarray has a Zarr backend option, but it's not clear to me if this would support interop to other languages as, say, Parquet allows.

Curious what the current state of affairs is for Zarr & DataFrames? And what the plans are for the future?

Thank you for the hard work!

alimanfoo commented 4 years ago

Hi @tbenst, thanks for asking, it's an interesting question.

I think the short answer is that Parquet provides a good solution for dataframe storage and has good library support and community momentum, so is currently the best option for dataframe storage for use with distributed & parallel computing.

That said, back in 2016 (how time flies!) @jreback did some work exploring zarr for dataframe storage, PR is here with lots of relevant discussion in the comment thread: https://github.com/zarr-developers/zarr-python/pull/84

FWIW I think zarr could be used for columnar dataframe storage, and there are some interesting differences with parquet that haven't been fully explored yet. If someone in the community is interested in working in that direction, we'd be interested in any thoughts or experience, particularly if they might influence any choices we might make in designing the v3 core protocol spec.

But the main focus of the core development team is on N-dimensional arrays, so we are unlikely to have effort to do development work in that direction ourselves.

Just my 2c, I'm not a dataframes expert so very interested to hear other views.