Evaluating Weld functions on Arrow memory without serialization

weld-project / weld

High-performance runtime for data analytics applications

https://www.weld.rs

BSD 3-Clause "New" or "Revised" License

2.99k stars 258 forks source link

Evaluating Weld functions on Arrow memory without serialization #236

Open wesm opened 7 years ago

wesm commented 7 years ago

hi folks,

I'm interested to see if this is something that we could work on together. Weld has its own memory format, e.g. as expressed in

https://github.com/weld-project/weld/blob/master/python/grizzly/numpy_weld_convertor.cpp

and elsewhere. Weld doesn't support nulls at the moment, so at some point you will need to define a consistent memory model to accommodate missing data across all supported data types. Rather than doing something "proprietary" to Weld, it would make Weld composable / embeddable in other systems if it ran on top of a standard memory format. (Aside: it would probably be a good idea to implement missing data support in Weld soon since the systems you're benchmarking against support it, which makes the benchmarks not quite apples-to-apples.)

Having spent many years dealing with these issues in the pandas world, I assert that the Arrow memory layout is the best one available, and where I'll be investing all my effort over the coming years for the pandas2 effort, see my recent write-up about this: http://wesmckinney.com/blog/apache-arrow-pandas-internals/

I would be excited to put Weld to use in pandas2 if Arrow were its native memory format for table columns, or at least there was an "Arrow mode" for Weld. If this is something I can help with, please let me know.

fsaintjacques commented 6 years ago

I think this is a blocker for serious/competitive usage of weld. Zero-copy is mandatory.

sppalkia commented 6 years ago

This is something we've been discussing for some time, and we're definitely interested in moving toward Arrow in the long run (as you said, we basically have an ad-hoc internal format that loosely resembles Python's nd-arrays right now)! Efficient missing data support is also something that Weld is currently missing and that Arrow seems to provide: we've currently been handling missing data in benchmarks by generating explicit nullity bit-maps (as opposed to encoding them in the underlying format).

If there are people interested in working on this immediately, we’d be happy to help out and provide support! Wes, do you know anyone who would be interested in taking this on as a side project? It’s something we may be able to prioritize after the summer.

On Thu, Apr 19, 2018 at 7:24 AM François Saint-Jacques < notifications@github.com> wrote:

I think this is a blocker for serious/competitive usage of weld. Zero-copy is mandatory.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/weld-project/weld/issues/236#issuecomment-382756374, or mute the thread https://github.com/notifications/unsubscribe-auth/ABTCY3EHoEudKeBzZZPhVP162a1jh1qDks5tqJ4hgaJpZM4PiCRN .

-- Shoumik

wesm commented 6 years ago

I'm very interested in the subgraph compiler problem for Arrow. It might be that we need to define a slightly higher level Arrow analytics IR that lowers to Weld DSL or IR. I'm not sure when I will personally have time for it, but I hope to fund this work once we raise more money at https://ursalabs.org/

SemanticBeeng commented 5 years ago

@wesm , @sppalkia - would you be able to plot a path (goals, list of todos, online resources) for 2-3 months worth of work on this?

Quite interested and have a "stake" in this. Part of a small dev group that might be interested to put some effort into it. Can handle quite well JVM, Python, C/C++ while only basic Rust.

Would appreciate vision & architecture level things like fit with Gandiva, Arrow Flight, etc. And how the two communities see fit with @rustlang memory management.

See https://youtu.be/HgtRAbE1nBM?t=2364 for implications of memory management for "software composition", especially important for distributed data analyses sharing same data fabric.

Its a very long video so I posted some highlights here http://www.evernote.com/l/AK98eIiEFRVJ-4weYv4ZtHIOyjuTDgJ58iY/.

@mateiz care to share some insights if such integration is of interest to you? And if can see benefits of such a data fabric for ML flow.