is the format stable? - Githubissues

wesm / feather

Feather: fast, interoperable binary data frame storage for Python, R, and more powered by Apache Arrow

Apache License 2.0

2.74k stars 169 forks source link

is the format stable? #183

Closed njsmith closed 8 years ago

njsmith commented 8 years ago

My impression from taking to @wesm before was that it wasn't yet, and that you still reserved the right to make breaking changes (esp. since the format isn't even documented outside of the source code, never mind having been peer reviewed). But it looks like people are using it now, and there doesn't seem to be any statement either way anywhere public that I can find.

It would be good to add a prominent notice to the readme describing what the current stability (non-)guarantees actually are.

wesm commented 8 years ago

From the initial announcements (e.g. http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/):

When Should You Not Use Feather? Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable across versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

I agree we should put this prominently in the README here on GitHub, sorry about this oversight

wesm commented 8 years ago

I should point out a major motivation in keeping the format malleable was to invite contributions from the community (e.g. peer review). For example, someone recently suggested adding "row groups" (so that you can write a file in multiple chunks rather than all at once).

chasemc commented 5 years ago

I'm trying to implement a solution for storing data with cross-language support in an R-based bioinformatics application, but have been seen some posts with the warning that the feather format is still not stable even though the long-term storage warning in was removed (https://github.com/wesm/feather/commit/c1052ecc2af7ee7df432b0ef8502c41810211a5a) two years ago.

Can it be confirmed if it will be stable moving forward?

I can track the feather version in the application but would rather not go through the trouble if unnecessary.

Thanks!

xhochy commented 5 years ago

@chasemc The feather format will probably never be stable. For long term storage it is better to use a format like Apache Parquet which is support by pyarrow in Python and arrow in R: https://github.com/apache/arrow/blob/master/r/R/parquet.R

wesm commented 5 years ago

If you do store data as Feather, there will be always a away to migrate the files away to Parquet format (e.g. using pyarrow) if there is a breaking change. I have been waiting for a few years for the R community to ship Apache Arrow for R, so as soon as that happens, I will conduct a next iteration of the Feather format (based on the Arrow IPC protocol) to improve performance and features

chasemc commented 5 years ago

Thanks for the quick replies @xhochy @wesm. Maybe it would be worth it to add back to the README the warning or @wesm 's reply ("there will be always a away to migrate the files away to Parquet format (e.g. using pyarrow) if there is a breaking change"). I think that could be helpful.

wesm commented 5 years ago

A PR would be welcome

multimeric commented 2 years ago

Hi all, sorry to unearth this thread, but this question has surfaced again regarding the viability of replacing other tabular formats (notably CSV) with feather. My understanding was that Feather v2 is identical to the in-memory Arrow format, which itself is stated to be stable:

The Arrow columnar format and protocol is considered stable, and we intend to make only backwards-compatible changes, such as additional data types.

Considering this, can you explain what about Feather v2 is unstable? Can we not make the same claims about stability as Arrow itself?

I take the point about Parquet having good guarantees, but it lacks many of the crucial features of Feather such as the O(1) random access, and for this reason does not solve my dilemma.

wesm commented 2 years ago

You can consider Feather V2 to be stable. If it meets your application requirements, I think it's safe to use in production. At minimum you'll always be able to download and install software artifacts that can read the files, and there will surely be a transition option (a library that can help migrate data) if there are ever breaking changes.