Status of JS reader for feather files

ellisonbg commented 7 years ago

JupyterLab will soon have the ability to render very large tables (millions of rows in memory, trillions for virtual datasets). We are starting to look at building native feature support into JupyterLab so a user can simply click on a feather file and view it interactively. The main barrier for us is having a solid JS library for parsing the files into consumable JSON data.

I see there is some sort of JS bindings in this repo and am wondering about their maturity, stability, etc. Are folks open to some refactoring into an installable npm package?

Here is the issue on JupyterLab tracking these things:

https://github.com/jupyterlab/jupyterlab/issues/2422

The table viewer we will be using is in phosphor.js here:

https://github.com/phosphorjs/phosphor/pull/283

wesm commented 7 years ago

Hi Brian -- I'm adding @elahrvivaz, @TheNeuralBit, and @anthonyccri from CCRI who are involved with the GeoMesa project. They have been doing Java and JS (TypeScript, see https://github.com/apache/arrow/tree/master/js) development of Apache Arrow, for a very similar use case to what you're describing -- dealing with large tables and doing slicing/dicing and visualization on the client side.

The Arrow metadata is more general than Feather, but the data memory layout is the same, so providing a Feather interface in JS is possible, but long term it's a much better idea to invest in the main Arrow format because it supports streaming, chunked files, nested data, and other features that aren't in Feather. In fact, I hope to deprecate the Feather metadata as soon as R has a solid binding to the Arrow C++ libraries. So far we're missing a champion in the R community to take on this task

I'm excited to see this machinery fall into place, and standardization on an on-wire columnar memory format (i.e. Apache Arrow) is a no-brainer -- I was talking with @scottdraves about this recently so he may like to follow the discussion.

TheNeuralBit commented 7 years ago

Thanks @wesm! Yes we've implemented a TypeScript arrow file and stream reader, and we've started work on a library that adds the ability to perform queries and count-bys on arrow data using those readers. We're planning on open sourcing this second library soon, or maybe just adding it to the main apache arrow library if that makes sense.

We've been investigating using an in-memory columnar format for interactively visualizing geo-spatial data, and have had a lot of success. We recently put together a video of our tools displaying 36 million FlightAware records. The chart at the bottom is a histogram of records by aircraft type, filtered by the current time window. Each bar can be selected to display only records with that aircraft type. And it all happens fast enough to render a smooth animation with a CPU, no WebGL.

@ellisonbg to answer your questions - at this point I'd say our arrow JS bindings are still pretty immature, but it would be great to have another party using the tools to help solidify the interface. I've tried to make sure both libraries will be installable via npm (see https://github.com/apache/arrow/pull/663). We haven't actually published a release yet, but you can "install" the library with npm link.

wesm commented 7 years ago

We're planning on open sourcing this second library soon, or maybe just adding it to the main apache arrow library if that makes sense.

I would say this is in scope for Arrow. We're going to be starting a C++ analytics library for Arrow soon, and I hope that Java starts one as well.

wesm commented 7 years ago

Could we start a document someplace to enumerate requirements for an Arrow JS implementation for JupyterLab? I think it would help rally the troops to have a TODO list in JIRA

scottdraves commented 7 years ago

This would be awesome to have!

jakevdp commented 6 years ago

I would love to be able to push toward feather as a data source format for vega/vega-lite visualizations. Any updates on this?

wesm commented 6 years ago

This is likely pretty straightforward to do now given the progress in the Arrow JS library -- that's where I would do the work (with the caveat that the Feather format is likely to see a major internal iteration in the next 12 months, so any work done now will need to done again for JS -- R and Python will share the same C++ code as now) cc @trxcllnt @TheNeuralBit

TheNeuralBit commented 6 years ago

@jakevdp @wesm yes! I would love to use Arrow as a data source for vega/vega-lite, and like Wes said I think we should be pretty close given the current state of Arrow JS.

Unfortunately I'm about to be traveling for a month, but after that I'll actually be moving to Seattle, and I'm very interested in helping out with this. In the meantime, if you want to start to tackle this yourself check out:

Source
Docs
Example usage on Observable
Apache Arrow Slack - @trxcllnt and I are usually available in the javascript channel

ellisonbg commented 6 years ago

@jakevdp - are you thinking of adding native support to VegaLite itself, or adding support in Altair and the various renderers?

On Wed, May 23, 2018 at 9:33 AM, Brian Hulette notifications@github.com wrote:

@jakevdp https://github.com/jakevdp @wesm https://github.com/wesm yes! I would love to use Arrow as a data source for vega/vega-lite, and like Wes said I think we should be pretty close given the current state of Arrow JS.

Unfortunately I'm about to be traveling for a month, but after that I'll actually be moving to Seattle, and I'm very interested in helping out with this. In the meantime, if you want to start to tackle this yourself check out:

Source https://github.com/apache/arrow/tree/master/js

Docs http://arrow.apache.org/docs/js/

Example usage on Observable https://beta.observablehq.com/@theneuralbit/introduction-to-apache-arrow

Apache Arrow Slack https://apachearrow.slack.com - @trxcllnt https://github.com/trxcllnt and I are usually available in the javascript channel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wesm/feather/issues/308#issuecomment-391414248, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0FRKxffepCIR4ajGdzJMskrVEh_bks5t1Y93gaJpZM4N1_en .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

jakevdp commented 6 years ago

I think it would be worth exploring pushing support into vega itself, so that data can be serialized to file more efficiently.

ellisonbg commented 6 years ago

I think that would be great! I do also imagine JupyterLab growing native support for Arrow over time as well. But the benefits of that won't be full realized if no other libraries can work with the format.

On Wed, May 23, 2018 at 10:00 AM, Jake Vanderplas notifications@github.com wrote:

I think it would be worth exploring pushing support into vega itself, so that data can be serialized to file more efficiently.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wesm/feather/issues/308#issuecomment-391423736, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0CWQ03MYckciIEU6TqSxlzyPuQRqks5t1ZW-gaJpZM4N1_en .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgranger@calpoly.edu and ellisonbg@gmail.com

wesm commented 6 years ago

I think as soon as JupyterLab has some basic widgets for interacting with Arrow binary data (like a table viewer, and Altair-based plotting widgets) it would be a pretty good carrot to get other systems sending data to JLab in that format

domoritz commented 6 years ago

I'd love to have support for some column based/binary format in Vega. I might have an undergrad working with me this summer that could help with some of this. Can you add me to the Slack (doesn't work with my @cs.washington.edu address)?

Also, could you clarify the distinction between .arrow, parquet, and feather files? I wanted to use this in a Vega test project but got stuck at the point where I create the binary files from a pandas df.

xhochy commented 6 years ago

@domoritz For slack you can register via the app at https://apachearrowslackin.herokuapp.com/

wesm commented 6 years ago

@domoritz I replied on the GitHub issue. To answer your questions about the file formats:

There aren't technically .arrow files, for lack of a better term we use the term "file format" to describe the random access format described in http://arrow.apache.org/docs/ipc.html
Parquet format is the columnar storage format defined in the Apache Parquet project https://github.com/apache/parquet-format. It doesn't have anything specifically to do with Arrow
Feather files are a simpler, more limited file format that predates the Arrow IPC stream and file formats noted above. As soon as there are R bindings available for the Arrow C++ libraries, I plan to replace the Feather internals with the Arrow IPC format to bring more features to the table http://wesmckinney.com/blog/feather-arrow-future/

wesm commented 4 years ago

Closing this. The recommended path for JavaScript is to use the Arrow IPC protocol which is supported and integration tested in JavaScript and being used in a variety of places (https://github.com/finos/perspective is a good example). I don't think there's a great deal of immediate value in implementing Feather support in JS for this use case

wesm / feather

Status of JS reader for feather files #308