Closed njsmith closed 8 years ago
From the initial announcements (e.g. http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/):
When Should You Not Use Feather? Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable across versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.
I agree we should put this prominently in the README here on GitHub, sorry about this oversight
I should point out a major motivation in keeping the format malleable was to invite contributions from the community (e.g. peer review). For example, someone recently suggested adding "row groups" (so that you can write a file in multiple chunks rather than all at once).
I'm trying to implement a solution for storing data with cross-language support in an R-based bioinformatics application, but have been seen some posts with the warning that the feather format is still not stable even though the long-term storage warning in was removed (https://github.com/wesm/feather/commit/c1052ecc2af7ee7df432b0ef8502c41810211a5a) two years ago.
Can it be confirmed if it will be stable moving forward?
I can track the feather version in the application but would rather not go through the trouble if unnecessary.
Thanks!
@chasemc The feather format will probably never be stable. For long term storage it is better to use a format like Apache Parquet which is support by pyarrow
in Python and arrow
in R: https://github.com/apache/arrow/blob/master/r/R/parquet.R
If you do store data as Feather, there will be always a away to migrate the files away to Parquet format (e.g. using pyarrow) if there is a breaking change. I have been waiting for a few years for the R community to ship Apache Arrow for R, so as soon as that happens, I will conduct a next iteration of the Feather format (based on the Arrow IPC protocol) to improve performance and features
Thanks for the quick replies @xhochy @wesm. Maybe it would be worth it to add back to the README the warning or @wesm 's reply ("there will be always a away to migrate the files away to Parquet format (e.g. using pyarrow) if there is a breaking change"). I think that could be helpful.
A PR would be welcome
Hi all, sorry to unearth this thread, but this question has surfaced again regarding the viability of replacing other tabular formats (notably CSV) with feather. My understanding was that Feather v2 is identical to the in-memory Arrow format, which itself is stated to be stable:
The Arrow columnar format and protocol is considered stable, and we intend to make only backwards-compatible changes, such as additional data types.
Considering this, can you explain what about Feather v2 is unstable? Can we not make the same claims about stability as Arrow itself?
I take the point about Parquet having good guarantees, but it lacks many of the crucial features of Feather such as the O(1) random access, and for this reason does not solve my dilemma.
You can consider Feather V2 to be stable. If it meets your application requirements, I think it's safe to use in production. At minimum you'll always be able to download and install software artifacts that can read the files, and there will surely be a transition option (a library that can help migrate data) if there are ever breaking changes.
My impression from taking to @wesm before was that it wasn't yet, and that you still reserved the right to make breaking changes (esp. since the format isn't even documented outside of the source code, never mind having been peer reviewed). But it looks like people are using it now, and there doesn't seem to be any statement either way anywhere public that I can find.
It would be good to add a prominent notice to the readme describing what the current stability (non-)guarantees actually are.