What is the use case of xnd over apache arrow?

nmichaud commented 6 years ago

Hello, was wondering what you plan on doing with xnd that isn't well supported by the arrow format. A github issue is probably not the best place for this discussion, but I couldn't find a mailing list. Thanks.

teoliphant commented 6 years ago

Primarily any container that is described by datashape which is more general than arrow --- as far as I am aware.

We will focus initially on tensor use-cases and look to be compatible in areas of overlap.

Xnd is a general purpose meta-container more than a single container.

Travis

On Mar 7, 2018 5:19 PM, "Naveen Michaud-Agrawal" notifications@github.com wrote:

Hello, was wondering what you plan on doing with xnd that isn't well supported by the arrow format. A github issue is probably not the best place for this discussion, but I couldn't find a mailing list. Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoDfrgN_8J5rJoRFYdTEkDIHbig91ks5tb_fdgaJpZM4SgnHR .

nmichaud commented 6 years ago

Ah ok, it seems most of the underlying types and structured types (lists, structs, ragged hierarchies) are already well supported in arrow. Anyway looking at the docs it should be pretty cheap to convert from one format to another through memory copying.

skrah commented 6 years ago

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

Also, the types that xnd uses (ndt_t) are a standard algebraic datatype that is relatively easy to use for traversing memory, which is needed in gumath. There is no dependency on an external C++ library like flatbuffers.

skrah commented 6 years ago

That said, we might translate Arrow to ndt_t in the future, but it is not an immediate priority (gumath is).

wesm commented 6 years ago

Are they? I thought Arrow was limited to int32_t in sizes. Xnd is for in-memory computations, so we definitely need int64_t.

This isn't quite true, see https://issues.apache.org/jira/browse/ARROW-750 -- support for very large variable-length collections is something we will eventually need to add to the format whenever there is demand for it.

In general, datasets will not be expected to be in a contiguous columnar memory block, but instead split across a collection of smaller chunks. We have discussed the 32- vs 64-byte issue for encoding collection lengths and the consensus has been that it is not worth the extra 4 bytes of overhead per value when the "large collection" case represents a very small percentage of use cases.

wesm commented 6 years ago

Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d#diff-520b20e87eb508faa3cc7aa9855030d7

skrah commented 6 years ago

On Wed, Mar 07, 2018 at 07:49:25PM +0000, Wes McKinney wrote:

Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d#diff-520b20e87eb508faa3cc7aa9855030d7

Thanks, I think I looked at the Arrow layout shortly before that date. We still need n-dimensional large fixed arrays though.

BTW, our ragged array type also uses int32_t, and is Arrow compatible in the offset and bitmap layout. I agree that it makes sense to use int32_t for the offsets, since they can get quite large.

wesm commented 6 years ago

Makes sense. In our experience, Tensors are a different beast and use case from structured columnar data, so we are handling ndarrays / tensors with metadata separate from 1D record batches: https://github.com/apache/arrow/blob/master/format/Tensor.fbs#L35. These use 64-bit shape and strides. This is used actively by the Ray project

teoliphant commented 6 years ago

Ideally we can convert without memory copying. I have not seen that arrow is general enough.

Ndtypes should support an arrow memory container. Arrow does not support general ndtypes descriptors.

One simplistic analogy: Arrow is a generalization of pandas. Xnd is a generalization of NumPy. Arrow could use libxnd for functionality, and the two should be complementary. There is overlap but the tradeoffs are quite different.

Also, if we construct libgumath correctly, Arrow could use it for chaining graphs of functionality -- or at least the two could be used together.

I am happy to be shown how the vision of xnd could be accomplished with just arrow. But, Stefan and I don't see that yet.

On Mar 7, 2018 8:08 PM, "Naveen Michaud-Agrawal" notifications@github.com wrote:

Ah ok, it seems most of the underlying types and structured types (lists, structs, ragged hierarchies) are already well supported in arrow. Anyway looking at the docs it should be pretty cheap to convert from one format to another through memory copying.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/6#issuecomment-371225023, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoD9AiYjW59ku_oe0cFcp03xkuO8gks5tcB7xgaJpZM4SgnHR .

wesm commented 6 years ago

There is overlap but the tradeoffs are quite different.

Agreed. We should look for opportunities to share code and infrastructure where possible. Note that the Arrow columnar format is but one type of data structure that we support -- it's a very important one for databases, Spark, pandas, etc. In order to implement zero-overhead memory sharing for structured datasets, many lower-levels of platform tooling must be created. I want to make sure we don't miss out on the collaboration opportunities for not having agreed on a "universal" data structure. The Arrow columnar format was never intended to be a universal data structure.

skrah commented 6 years ago

[Repost because of broken markdown in email replies.]

I agree it would be nice to have a standard low-level data structure. For C, Ndtypes is pretty standard: It describes all basic C types (including nested types, pointer types) using a regular algebraic data type. One could use it e.g. for the type part in "Modern Compiler Implementation in C" (Appel et al.) without changes.

The tagged union convention is also the same as in the quoted book, and incidentally also the same as in Python's own compiler (whose author probably also read Appel, given that he used ASDL to describe the AST :).

I think columnar data can be modeled in ndtypes as a record of arrays. The example from the Arrow home page:

>>> data = {'session_id': [1331247700, 1331247702, 1331247709, 1331247799],
...         'timestamp': [1515529735.4895875, 1515529746.2128427, 1515529756.4485607, 1515529766.2181058],
...         'source_ip': ['8.8.8.100', '100.2.0.11', '99.101.22.222', '12.100.111.200']}
x = xnd(data)
>>> x.type
ndt("{session_id : 4 * int64, timestamp : 4 * float64, source_ip : 4 * string}")

There is categorical data, the representation of which is an array of indices into the categories:

>>> levels = ['January', 'August', 'December', None]
>>> x = xnd(['January', 'January', None, 'December', 'August', 'December', 'December'], levels=levels)
>>> x.value
['January', 'January', None, 'December', 'August', 'December', 'December']
>>> x.type
ndt("7 * categorical('January', 'August', 'December', NA)")

There are nested tuples, which are more general than ragged arrays:

>>> unbalanced_tree = (((1.0, 2.0), (3.0)), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x = xnd(unbalanced_tree)
>>> x.value
(((1.0, 2.0), 3.0), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x.type
ndt("(((float64, float64), float64), float64, ((float64, float64, float64), ()))")
>>>
>>> x[0]
xnd(((1.0, 2.0), 3.0), type="((float64, float64), float64)")
>>> x[0][0]
xnd((1.0, 2.0), type="(float64, float64)")

In general, xnd just takes any basic Python value -- nested or not -- and unpacks it to typed memory.

wesm commented 6 years ago

I am skeptical about the idea of an all-powerful / can-describe-anything data structure. With generalization comes added complexity for computational frameworks and producers/consumers.

@teoliphant stated "I have not seen that arrow is general enough." What does this mean? At this point, the Arrow columnar format is only one part of a much larger project. I think this means "the Arrow columnar format is not a universal data structure", which I agree with, but that was never the goal. I see the work here in libndtypes / xnd as complementary and not in conflict -- there are problems being solved (extending the notion of NumPy's structured dtypes to support things like variable-length cells and pointers) that were never in scope for Arrow's columnar format.

The columnar format was the focus of the project at the outset because that was the most immediate and high value problem to solve around data interoperability and in-memory analytics. The rapid uptake of the project and developer community growth suggests we made a good bet on this.

At this point Arrow a multi-layered project of memory management, shared memory (Plasma), metadata serialization, IO, streaming messaging, memory formats (including the columnar format), file format interop, computation kernels, etc. The work that is being done here could even become an additional component of Apache Arrow if you wanted to work with a larger developer community. At minimum it would be helpful to have a broader design/architecture discourse about problems and use cases in a public venue.

pearu commented 5 years ago

Ideally we can convert without memory copying.

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying. However, currently the wrapping does not support null buffers (Arrow) or bitmaps (xnd) because xnd does not expose bitmaps.

wesm commented 5 years ago

@pearu the cases where the memory is compatible IMHO reflect a minority (and a small minority at that) of real world use of Arrow. To suggest "compatible, with some exceptions" will mislead people

pearu commented 5 years ago

@wesm, I am not sure that I follow your comments meaning. If you refer to the fact xnd does not expose bitmaps, then this issue can be easily fixed as xnd bitmap is compatible with Arrow null buffer. I guess the reason of not exposing xnd bitmaps is that it is consider as internal structure while in Arrow null buffer is not that.

wesm commented 5 years ago

You stated

As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying

I think it's worth making a list of different Arrow use cases:

Primitive (C-like) arrays with no nulls
Primitive arrays with nulls
Dictionary-encoded arrays
Varbinary / utf8 (variable-length) arrays with nulls
Nested type arrays (e.g. list<binary> or list<int32>)
Unions

Do you support them all and export all of their semantics in xnd? If the answer is "no", then I think you need to qualify the statement to say that "In certain limited cases, one can wrap Arrow arrays [and expose their semantics], without memory copying"

xnd-project / libxnd

What is the use case of xnd over apache arrow? #6