xnd-project / libxnd

Subsumed into xnd
https://xnd.io/
BSD 3-Clause "New" or "Revised" License
80 stars 12 forks source link

Store xnd objects in plasma #4

Closed saulshanabrook closed 6 years ago

saulshanabrook commented 6 years ago

We should be able to store xnd objects in the Plasma store, similar to how you can store arrow memory. In the blog post introducing Apache Arrow's usage with Plasma, they outline the motivation for the store. Similar to arrow, it would be beneficial to store xnd objects, so that multiple processes on on one computer could all access them without copying or serialization.

saulshanabrook commented 6 years ago

Originally I thought I could store an xnd object in Plasma without copying, but I no longer believe this to be the case. The create command returns a new memory buffer that Plasma creates and hands you a pointer to. It doesn't seem to support taking ownership over an existing piece of memory.

This does make sense, since Plasma stores immutable data.

teoliphant commented 6 years ago

I think the idea is that you create an object in plasma, then seal it. Plasma becomes the "malloc". It is the owner of the shared memory. This makes sense.

1) We need to make sure xnd can take a reference to other memory and 2) We need to make sure that plasma can accept ndtypes descriptions for the memory

-Travis


Travis Oliphant Quansight

On Mon, Mar 5, 2018 at 11:25 AM, Saul Shanabrook notifications@github.com wrote:

Originally I thought I could store an xnd object in Plasma without copying, but I no longer believe this to be the case. The create command returns a new memory buffer that Plasma creates and hands you a pointer to. It doesn't seem to support taking ownership over an existing piece of memory.

This does make sense, since Plasma stores immutable data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-370495933, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoL45TQuxr8rAWByFsriZPnRU2ubTks5tbXTvgaJpZM4Scd2O .

saulshanabrook commented 6 years ago

Yeah that makes sense.

So do we have to be able to serialize the XndObject to that contiguous buffer?

typedef struct {
    PyObject_HEAD
    MemoryBlockObject *mblock; /* owner of the primary type and memory block */
    PyObject *type;            /* owner of the current type */
    xnd_t xnd;                 /* typed view, does not own anything */
} XndObject;

In trying to get a sense of exactly where the memory is stored, is it possible to inspect these fields in a Python REPL on an an xnd instance?

skrah commented 6 years ago

On Mon, Mar 05, 2018 at 06:18:54PM +0000, Saul Shanabrook wrote:

So do we have to be able to serialize the XndObject to that contiguous buffer?

As I understand this https://arrow.apache.org/docs/python/plasma.html#putting-and-getting-python-objects

"This works with all Python objects supported by the Arrow Python object serialization.",

we'd need Arrow serialization. I'm not familiar with plasma though.

teoliphant commented 6 years ago

We don't want to serialize anything (except the description of the buffer). We want to

1) allocate memory in plasma and ensure that the ndtypes description is also stored. 2) get a handle the to the buffer and place it in the xnd container 3) get the ndtypes description of the buffer and ensure the xnd container has the same one.

-Travis


Travis Oliphant Quansight

On Mon, Mar 5, 2018 at 12:18 PM, Saul Shanabrook notifications@github.com wrote:

Yeah that makes sense.

So do we have to be able to serialize the XndObject to that contiguous buffer?

typedef struct { PyObject_HEAD MemoryBlockObject mblock; / owner of the primary type and memory block / PyObject type; / owner of the current type / xnd_t xnd; / typed view, does not own anything / } XndObject;

In trying to get a sense of exactly where the memory is stored, is it possible to inspect these fields in a Python REPL on an an xnd instance?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-370512504, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoD4qwbJ2jUel_gPf4f2m5RuHOUykks5tbYE4gaJpZM4Scd2O .

saulshanabrook commented 6 years ago

So would the desired user flow be something like:

1) Create plasma client connected to store 2) Tell xnd to use this client 3) Create some containers with xnd 4) Retrieve their plasma ids 5) Start another process 6) Create plasma client again 7) Get access to xnd containers using plasma ids

Right now, memory allocation in xnd uses the ndt_aligned_calloc function. This happens when creating a new xnd container, when creating a Ref, and when creating bytes.

If we could get a pointer back from Plasma's create function, then we could use that instead of ndt_aligned_calloc. If we are storying bytes or references, then this means storing multiple Plasma objects.

Is it right to assume that different plasma clients could have different pointer addresses for the same plasma object? Because we are using virtual memory?

If this is the case, then it might not be good to store the pointer for a Ref directly in the xnd data, but instead store the plasma ID instead (same with bytes).

saulshanabrook commented 6 years ago

I opened an issue asking if it was possible to get a pointer back from the plasma store, to write to https://github.com/apache/arrow/issues/1706

skrah commented 6 years ago

Here is my current take on it.

Option 1:

I would not like to replace existing allocation functions. Instead, I'd like to proceed cautiously and create a separate pxnd (p for parallel) object.

We would only support pointer-free types. The pxnd object would have a datashape Python string object and a plasma object id and could therefore be pickled.

Creating this object would allocate a data block in plasma.

I assume that the picklable pxndobject would be sent via a pipe to consumer processes.

In such a process, the consumer would call something like pxnd.get(), which would use the existing MemoryBlockObject in the same way as the from_buffer function and return a regular xnd object.

That is to say, the memory in the xnd_master buffer is marked as external, but creation of the xnd object largely proceeds in the usual way. This is the same way the existing PEP-3118 imports are handled.

This all hinges on the possibility of getting a raw pointer from the plasma data store.

If Arrow headers are required for this, I'd like to check for these headers in ./configure and make the compilation of the pxnd object conditional on the presence of the headers.

In general, I'd like to keep the pxnd object in a separate section or file and be able to compile and use xnd without plasma support.

Option 2:

Only allow NumPy types and hope that the existing plasma functions take a PEP-3118 buffer. In that case, we'd need to add PEP-3118 exporting facilities to xnd.

teoliphant commented 6 years ago

Option 2 is a non-starter in my mind as we need something that extends NumPy types. The data-shape and xnd should be usable where PEP-3118 is not.

A variant for Option 1 is fine in my mind too. Plasma must be an optional part. It is only an important use-case and not something xnd should rely on.

Because XND can accept foreign buffers as it's memory, this should be straightforward. You just have another module for interfacing with plasma. Introducing ndtypes to plasma is a good opportunity to see how ndtypes and Arrow interact.


Travis Oliphant Quansight

On Tue, Mar 6, 2018 at 6:19 AM, Stefan Krah notifications@github.com wrote:

Here is my current take on it.

Option 1:

I would not like to replace existing allocation functions. Instead, I'd like to proceed cautiously and create a separate pxnd (p for parallel) object.

We would only support pointer-free types. The pxnd object would have a datashape Python string object and a plasma object id and could therefore be pickled.

Creating this object would allocate a data block in plasma.

I assume that the picklable `pxnd``object would be sent via a pipe to consumer processes.

In such a process, the consumer would call something like pxnd.get(), which would use the existing MemoryBlockObject in the same way as the from_buffer function and return a regular xnd object.

That is to say, the memory in the xnd_master buffer is marked as external, but creation of the xnd object largely proceeds in the usual way. This is the same way the existing PEP-3118 imports are handled.

This all hinges on the possibility of getting a raw pointer from the plasma data store.

If Arrow headers are required for this, I'd like to check for these headers in ./configure and make the compilation of the pxnd object conditional on the presence of the headers.

In general, I'd like to keep the pxnd object in a separate section or file and be able to compile and use xnd without plasma support.

Option 2:

Only allow NumPy types and hope that the existing plasma functions take a PEP-3118 buffer. In that case, we'd need to add PEP-3118 exporting facilities to xnd.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-370763793, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoEbuyKM_mcI1lN7_BkSLhfYfTogDks5tbn7tgaJpZM4Scd2O .

skrah commented 6 years ago

On Tue, Mar 06, 2018 at 01:12:04PM +0000, Travis E. Oliphant wrote:

Option 2 is a non-starter in my mind as we need something that extends NumPy types. The data-shape and xnd should be usable where PEP-3118 is not.

Agreed. It's more of a fallback if some unknown problems arise.

saulshanabrook commented 6 years ago

The C++ plasma code looks like it can give you back a pointer to the buffer(https://arrow.apache.org/docs/cpp/classplasma_1_1_plasma_client.html#ac18ab9cc792c620a97a3dcb165e0ecd7), but I am not sure if it's possible for use that, since we are using C and they don't export anything to C.

The Cython wrapper might also be helpful, but I also don't think it's possible for us to use that in our C code, since they don't make anything public. I opened a thread on the arrow listserve to see if they have any guidance on using the plasma API from C.

teoliphant commented 6 years ago

We may have to make a pull request to add a simple C-export of the buffer. Alternatively the C++ code that binds libxnd to plasma can create that API. We can start there. This should be a separate library from libxnd.

-Travis


Travis Oliphant Quansight

On Tue, Mar 6, 2018 at 4:01 PM, Saul Shanabrook notifications@github.com wrote:

The C++ plasma code looks like it can give you back a pointer to the buffer(https://arrow.apache.org/docs/cpp/classplasma_1_1_ plasma_client.html#ac18ab9cc792c620a97a3dcb165e0ecd7), but I am not sure if it's possible for use that, since we are using C and they don't export anything to C.

The Cython wrapper https://github.com/apache/arrow/blob/master/python/pyarrow/plasma.pyx might also be helpful, but I also don't think it's possible for us to use that in our C code, since they don't make anything public. I opened a thread https://lists.apache.org/thread.html/0ba7b568fd29a1250c73103cf7da9e6f5cd54498184ef8a79d49b408@%3Cdev.arrow.apache.org%3E on the arrow listserve to see if they have any guidance on using the plasma API from C.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-370944392, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPjoFW0fo8oRHDjrdot_CGa12pq9A1bks5tbwcggaJpZM4Scd2O .

wesm commented 6 years ago

adding @robertnishihara @pcmoritz @pitrou to go along with the mailing list thread. Plasma buffers implement the buffer protocol so it seems like you should be able to use that

pitrou commented 6 years ago

Is xnd able to use foreign buffers as backing memory? If not, is there a reason why? Basically Plasma needs to allocate the memory itself in a certain way, to be able to hand it to its various client processes.

(I'm assuming there may be limitations such as contiguity or restriction to certain well-behaved types)

pitrou commented 6 years ago

To be clear, my question is whether xnd supports the equivalent of np.frombuffer() to create xnd objects.

wesm commented 6 years ago

There are some comments here about type metadata:

2) We need to make sure that plasma can accept ndtypes descriptions for the memory

Plasma is unconcerned with the contents of the memory. So you will need to take responsibility for writing a serialized metadata header to prefix the xnd memory. We are using Flatbuffers for our serialized metadata for Arrow

saulshanabrook commented 6 years ago

Is xnd able to use foreign buffers as backing memory?

Yes. So we should be able to call the Plasma Python API from our C code that exposes our own Python API. By accessing that buffer in C, then we can get access to the pointer.

I will try that approach before moving forward on any kind of C wrapper around the Plasma C++ API.

skrah commented 6 years ago

On Wed, Mar 07, 2018 at 04:03:56PM +0000, Antoine Pitrou wrote:

Is xnd able to use foreign buffers as backing memory?

Yes, there's xnd.from_buffer(), using PEP-3118. xnd has the MemoryBlockObject that manages foreign memory.

teoliphant commented 6 years ago

The problem with using from_buffer is that PEP-3118 does not support full ndtypes. What I'd like to see is plasma support for any structure that can be described by libndtypes (i.e. datashape). We will need this eventually. It would be good to get collaborating early if possible.

saulshanabrook commented 6 years ago

Does plasma set the format of the the buffer that is returned?

wesm commented 6 years ago

No -- the Plasma data is just bytes. You'll need to define a serialized data header for xnd data. The way we do it for Arrow objects is:

See https://github.com/apache/arrow/blob/master/format/IPC.md#encapsulated-message-format

saulshanabrook commented 6 years ago

No -- the Plasma data is just bytes. You'll need to define a serialized data header for xnd data.

@wesm OK, that makes sense. Just to make sure I understand correctly, the plasma python API provides a PEP 3118 buffer but without the shape, format, strides, ndim, suboffets fields, so it's really just a pointer to the data. So you can write whatever data format you want, and any type information if you want to use that later to read the data. i.e. there isn't anything arrow specific in plasma.

Is xnd able to use foreign buffers as backing memory? If not, is there a reason why? Basically Plasma needs to allocate the memory itself in a certain way, to be able to hand it to its various client processes.

@pitrou Ideally, we would access Plasma with libxnd, which isn't specific to Python. That is more direct than going through the Python API and also other language bindings for libxnd will be able to use that integration.

wesm commented 6 years ago

So you can write whatever data format you want, and any type information if you want to use that later to read the data. i.e. there isn't anything arrow specific in plasma.

Yep, that's right. We aren't forcing anyone to use any particular metadata representation. So you're free to define a metadata prefix that uses libndtypes, or anything you like

saulshanabrook commented 6 years ago

I am closing this for now, since I moved the plasma integration to it's own repo. https://github.com/plures/pxnd If I hit any roadblocks there that I think need to be addressed on the Arrow side, I will open an issue on that repo and reach out on the mailing list.

wesm commented 6 years ago

My two cents: I suspect that a more monorepo-like approach will make your life much easier rather than having to maintain interconnected C APIs amongst various Python packages with C libraries / C extensions. You can also make changes to multiple components at once without causing your CI setup to fall apart (i.e. if you have lots of repos, you may end up with multiple PRs related to a single change, each of which may cause the CI for any of the repos in the stack to break temporarily)

skrah commented 6 years ago

This is true. It worked because I was really the only developer up to now. :)

I would like to stabilize gumath a bit -- currently it's sometimes convenient to change ndtypes/xnd and worry later about gumath. But in a couple of months I think we can have something like a monorepo.

skrah commented 6 years ago

Also, if we want to add __truediv__ etc. to xnd objects instead of just using libgumath functions like truediv(), we have a circular relationship between xnd/gumath.

pcmoritz commented 6 years ago

Hey Stefan,

Why don't we move the plasma C bindings into the arrow repo as a first step now? I'm happy to help with setting up the build system. They might also be useful for other people!

-- Philipp.

On Fri, Mar 16, 2018 at 10:27 AM, Stefan Krah notifications@github.com wrote:

Also, if we want to add truediv etc. to xnd objects instead of just libgumath functions like truediv(), we have a circular relationship between xnd/gumath.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-373786549, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG6pMQq7TZtWjBxuXi-1LqUHAOf_fStks5te_XsgaJpZM4Scd2O .

skrah commented 6 years ago

@pcmoritz Do you mean the part that is independent from xnd? I think that is a good idea, it would be very nice to have general plasma C bindings.

pcmoritz commented 6 years ago

Yeah, I mean this part here: https://github.com/plures/pxnd/tree/master/ libplasma

We can put it here: https://github.com/apache/arrow/tree/master/cpp/src/ plasma (for example as libplasma.h)

Feel free to create a PR and ping me if you need help with CMake.

On Fri, Mar 16, 2018 at 10:58 AM, Stefan Krah notifications@github.com wrote:

@pcmoritz https://github.com/pcmoritz Do you mean the part that is independent from xnd? I think that is a good idea, it would be very nice to have general plasma C bindings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plures/xnd/issues/4#issuecomment-373795803, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG6pN_s6EiH6vgMOKqBRFEVKI_Z0E9Eks5te_1CgaJpZM4Scd2O .

saulshanabrook commented 6 years ago

@wesm Yeah I am finding I might need to make a few changes to xnd as I am developing pxnd, so a monorepo might make sense eventually. It is nice to have the build separate so that xnd doesn't have to depend on Plasma and since authorship is so far split between them.

@pcmoritz Sure, I can do that soon. I am still iterating on them a lot to make sure they work correctly for us. It's my first time writing this sort of code, so I wanna see it working end to end before I believe it's correct. Once they are stable, I am happy to move them out, probably next week or the week after.