wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

[pandas 2.0] Prototype array view interface, buffer copying #40

Closed wesm closed 7 years ago

wesm commented 7 years ago

Please leave code comments on Gerrit: https://review.gerrithub.io/#/c/298035/

wesm commented 7 years ago

Here's basically my thoughts on how array reference counting and copy-on-write will work:

Construct an array and an array view (the basic array container used in pandas Series / DataFrame):

std::vector<double> values = {0, 1, 2, 3, 4, 5, 6, 7};

auto buffer = std::make_shared<Buffer>(reinterpret_cast<const uint8_t*>(values.data()),
    values_.size() * sizeof(double));

ArrayView view(std::make_shared<DoubleArray>(values_.size(), buffer));

Suppose we want to mutate view:

const Array* ap = view.data().get();
ASSERT_OK(view.EnsureMutable());
ASSERT_EQ(ap, view.data().get());

If there is only one reference to the underlying array, then EnsureMutable is a no-op. Now, let's make a copy of the view:

ArrayView view2 = view_;

Now we have:

ASSERT_EQ(2, view_.ref_count());

ASSERT_OK(view.EnsureMutable());

// View has a new array (with copied memory)
ASSERT_NE(ap, view.data().get());

// view2 has the original array
ASSERT_EQ(ap, view2.data().get());

// both have reference count 1
ASSERT_EQ(1, view_.ref_count());
ASSERT_EQ(1, view2.ref_count());

NumPy has an explicit notion of data ownership. My opinion is that in pandas there is only data sharing -- if you share a pointer to an array, you cannot mutate it without making a copy first. This prevents parents and children from having side effects on each other.

jreback commented 7 years ago

how does this reconcile with python reference semantics, e.g.

In [4]: df = tm.makeMixedDataFrame()

In [5]: df
Out[5]: 
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

In [6]: s = df['B']

In [7]: df
Out[7]: 
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

In [8]: s
Out[8]: 
0    0.0
1    1.0
2    0.0
3    1.0
4    0.0
Name: B, dtype: float64

so its clear that using __getitem__ on df increments the refcount. So mutating ops on s ([8]) will cause a copy to that object and leave the original df alone.

Further this implicates

In [9]: df2 = df

In [10]: s2 = df2['A']

mutation in df2 ([9]) WILL be the same as if I am changing df I would think. This is a shared reference (and by-definition you cannot increment the refcounter as its a python reference assignement).

Since s2 ([10]) points to df2 which points to df, I think this has the same semantics as change s ([8]).

wesm commented 7 years ago

These views are independent of Python object ref counts. So if you did

s = df['A']
s2 = s.copy()

This actually does not need to do a deepcopy of s's data, it only bumps the refcount on the underlying buffer. A Series that's created from a DataFrame shared memory ownership of the underlying array.

Creating a new Python reference to a pandas object wouldn't change the internal array reference counts.

One issue to be aware of with this framework is that chained assignment on Series may stop working, unless we go to some effort to make it continue to work, e.g.:

df['A'][5:10] = 10

But we already have

In [1]: import pandas.util.testing as tm

In [2]: df = tm.makeMixedDataFrame()

In [3]: df
Out[3]: 
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

In [4]: df['B'][2:4] = 10
[15:23:53.134 WARNING] /home/wesm/.conda/envs/pandas/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/home/wesm/.conda/envs/pandas/bin/python

In [5]: df
Out[5]: 
     A     B     C          D
0  0.0   0.0  foo1 2009-01-01
1  1.0   1.0  foo2 2009-01-02
2  2.0  10.0  foo3 2009-01-05
3  3.0  10.0  foo4 2009-01-06
4  4.0   0.0  foo5 2009-01-07

Frankly it would make our lives easier as developers if we deprecated this behavior (the user would be free to assign directly though, e.g. df.iloc[0, 2:4] = 10, and that would work)

leifwalsh commented 7 years ago

How does the Buffer know about the lifetime of values? Or does it copy the data once in the constructor? What's value_t?

I'm not clear (from your description, haven't looked at code) on what exactly happens when you copy a view. It looks like it copies the pointer, but then how does the refcount stay 1 and why is view still mutable? When would you actually copy the data, in the EnsureMutable call?

How would you feel about an API that has a View and a MutableView class, and you can get a zero-copy View from another View, get a singly-owned View from a MutableView, and get a MutableView from a View which copies he data iff the refcount is > 1? That is, encode mutability into the type system, rather than needing to remember to call EnsureMutable all over the place?

wesm commented 7 years ago

How does the Buffer know about the lifetime of values? Or does it copy the data once in the constructor? What's value_t?

Buffer has a virtual dtor, so it depends on what kind of buffer. Instances of the base class have no data ownership, but memory allocated from a MemoryPool does, for example https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/buffer.cc#L49

I'm not clear (from your description, haven't looked at code) on what exactly happens when you copy a view. It looks like it copies the pointer, but then how does the refcount stay 1 and why is view still mutable? When would you actually copy the data, in the EnsureMutable call?

The statement:

ArrayView view2 = view_;

Creates a copy of the std::shared_ptr member, making the reference count two. When you call EnsureMutable, it create a copy of the array if the use count is > 1, otherwise it is a no-op.

How would you feel about an API that has a View and a MutableView class, and you can get a zero-copy View from another View, get a singly-owned View from a MutableView, and get a MutableView from a View which copies he data iff the refcount is > 1? That is, encode mutability into the type system, rather than needing to remember to call EnsureMutable all over the place?

This is a slippery slope. Every pandas array container (what's inside a Series) will have an array pointer, possibly with ownership shared with another array (or table / DataFrame internals). When you invoke a libpandas function that mutates an array, you need to ask permission somehow to do so. So you could make EnsureMutable virtual and have:

mutable_view.EnsureMutable() -> OK
immutable_view.EnsureMutable() -> Error

but adding a bool mutable_ flag in the ArrayView would accomplish the same thing.

In practice very few methods in pandas have side effects -- e.g. assignments and inplace versions of functions like fillna -- so I'm not worried about having lots of usages of this "copy-on-write (maybe)" method.

The Buffer objects are another story. arrow::Buffer is immutable, whereas there is an arrow::MutableBuffer : public arrow::Buffer which has uint8_t* mutable_data() method -- these control actual physical access to memory.

It's worth contrasting these details with the way that NumPy works, which is not ideal for our purposes:

In [10]: arr = np.arange(10)

In [11]: arr2 = arr[2:5]

In [12]: arr.flags
Out[12]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [13]: arr2.flags
Out[13]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [14]: arr.flags.writeable = False

In [15]: arr2[:] = 12345

In [16]: arr
Out[16]: array([    0,     1, 12345, 12345, 12345,     5,     6,     7,     8,     9])

In [17]: arr2
Out[17]: array([12345, 12345, 12345])

Some obvious not good things:

The OWNDATA flag is also something you can set with the C API -- here there will only be shared_ptr's keeping track of how many objects have access to a particular array.

Relatedly, in Arrow I made the following Buffer subclass:

class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer {
 public:
  NumPyBuffer(PyArrayObject* arr)
    : Buffer(nullptr, 0) {
    arr_ = arr;
    Py_INCREF(arr);

    data_ = reinterpret_cast<const uint8_t*>(PyArray_DATA(arr_));
    size_ = PyArray_SIZE(arr_) * PyArray_DESCR(arr_)->elsize;
    capacity_ = size_;
  }

  virtual ~NumPyBuffer() {
    Py_XDECREF(arr_);
  }

 private:
  PyArrayObject* arr_;
};

This lets us treat a NumPy array as a Buffer with zero copy

leifwalsh commented 7 years ago

This is a slippery slope.

...that leads to what?

I'm concerned with programming errors where you forget to call EnsureMutable but then use the mutating parts of the API anyway. Rather than calling a side-effecting method that makes sure the object you had was mutable, I'd rather you call a function that gives you a new "value" which is mutable. When you're done mutating and you want to make more views without copying, you can use the move constructor of the immutable view to ensure that the programmer gives up the right to mutate. The Buffer/MutableBuffer split in arrow sounds exactly like this, why would pandas not want this?

wesm commented 7 years ago

We should try to look at this in the actual code. In Arrow, arrays are immutable, so this is not something that the library deals with. In pandas, people expect mutability, so we need to support it in a way that is safe (i.e. parent and child can mutate their data without side effects on the other). As an aside, I'm not worried about coding errors around this -- this is why we write unit tests, and these details will not be exposed to pandas users.

In this patch, here is what a floating point array looks like:

class FloatingArray : public NumericArray {
 protected:
  FloatingArray(const std::shared_ptr<DataType>& type, int64_t length, const std::shared_ptr<Buffer>& data);
  std::shared_ptr<Buffer> data_;
};

template <typename TYPE>
class FloatingArrayImpl : public FloatingArray {
 public:
  using T = typename TYPE::c_type;

  FloatingArrayImpl(int64_t length, const std::shared_ptr<Buffer>& data);

  Status Copy(int64_t offset, int64_t length, std::shared_ptr<Array>* out) const override;
  PyObject* GetItem(int64_t i) override;
  Status SetItem(int64_t i, PyObject* val) override;

  int64_t GetNullCount() override;

  bool owns_data() const override;

  const T* data() const;
  T* mutable_data() const;
};

To support zero-copy views and slicing, we have two options:

  1. Compose the array with an offset and length. This is the ArrayView object we've been talking about
  2. Create a new Array that references the parent array's buffers

Option 2 has a number of issues, the first is that it's difficult for the buffer itself to understand the parent-child relationships. If you create a new buffer that references the parent, then the parent needs to know that the child exists (for copy-on-write). If you introduce instead an array offset, e.g.

class FloatingArray : public Array {
 private:
  std:shared_ptr<Buffer> data_;  // <-- SAME BUFFER AS PARENT
  int64_t offset_;
  int64_t length_;
};

Now we have to introduce an offset into every indexing operation (this is can't really be avoided with null bitmaps, though, because the view offset might split a byte).

My gut feeling (which won't be proven out until quite a bit of code gets written) is to use Arrays as the unit of shared reference for copy-on-write (rather than slicing or offsetting into an array's internal buffers) and instead use the Buffer/MutableBuffer distinction only for controlling pandas's manipulation of internally or externally allocated memory. For example, we can construct zero-copy views of NumPy arrays, but we cannot mutate then without copying (e.g. the NumPyBuffer above does not have a non-const reference to the NumPy array's data). Memory that pandas itself allocates would be held in a MutableBuffer, so that can be mutated as long as there is only one reference to it.

wesm commented 7 years ago

Another problem with dealing with copy-on-write inside arrays will have to do with other data structures that compose arrays, for example

class Table {
 public:
   std::shared_ptr<Array> column(int i) const;

 private:
   std::vector<std::shared_ptr<Array>> columns_;
};

If you get a column from the table, where is the need to copy-on-write made known to the array? that contrasts with

class Table {
 public:
   const ArrayView& column(int i) const;

 private:
   std::vector<ArrayView> columns_;
};

Here, the copy-on-write contract is more explicit, because there's an object that is aware of the number of references to a particular array.

wesm commented 7 years ago

@deads I'd be interested in your thoughts on this topic

wesm commented 7 years ago

I'm going to merge this as WIP and continue prototyping basic scaffolding