wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

Revisit Series implicit size mutability and implicit type conversions #32

Open wesm opened 7 years ago

wesm commented 7 years ago

I haven't been able to wrap my head around these behaviors:

In [10]: s = pd.Series()

In [11]: s['a'] = 7

In [12]: s
Out[12]: 
a    7
dtype: int64

or

In [2]: s = pd.Series([1, 2, 3])

In [3]: s['4'] = 'b'

In [4]: s
Out[4]: 
0    1
1    2
2    3
4    b
dtype: object

On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).

Let me know other thoughts about this.

chris-b1 commented 7 years ago

Personally, I'd let the type conversion on mutation go away completely, I don't see any upside to that, especially with a first class NA value that doesn't force casts.

More ambivalent on setting with expansion - sometimes that is useful (or at least convenient), e.g., below with a MultiIndex would be kind of annoying to express with a reindex? Granted this isn't very idiomatic and allowing this does add plenty of complexity to the indexing code.

In [27]: df = pd.DataFrame({'a': [1, 2, 3, 4]},
    ...:    index=pd.MultiIndex.from_product([['a', 'b'], ['k', 'd']]))

In [28]: idx = pd.IndexSlice

In [29]: df.loc[idx['a','j'], :] = 55

In [30]: df
Out[30]: 
        a
a k   1.0
  d   2.0
b k   3.0
  d   4.0
a j  55.0
shoyer commented 7 years ago

I'm strongly supportive here -- this is a variation of #21.

If we disable automatic reindexing in __setitem__, for consistency I think we should do the same for __getitem__, too. We've been quite happy with only an explicit .reindex() method for doing reindexing in xarray.

jreback commented 7 years ago

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

However there is a special case that needs addressing. In particular, even though this is a discouraged pattern, its used a lot. e.g.

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

so an empty Series really can't default to dtype, and needs to be coerced on first fill. IOW, do we have a concept of .empty? This is further compilcated by residual dtypes from removals, IOW.

so this is clear

In [1]: s = Series([1,2,3])

In [2]: s[0:0]
Out[2]: Series([], dtype: int64)

but is this the same as Series()? or is Series() a dtype of 'empty' (or maybe 'any')? IOW, what happens if I now add an element to [2]?

shoyer commented 7 years ago

@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.

Xarray is immutable in the same way proposed here for pandas: DataFrame columns can be mutated and existing rows can be mutated, but rows cannot be inserted or removed. This requires users to do things in a reasonable efficient way, because, like pandas, data is ultimately stored in non-resizeable arrays.

Implicit size mutability is problematic because a naive user might expect that data is backed by a data structure that can be efficiently resized, like Python's list, but this isn't the case (the .append method is problematic for similar reasons). Now that we are redoing the guts of pandas in C++, we could actually store data in resizeable arrays, but on the other hand that does entail some additional complexity and memory overhead.

wesm commented 7 years ago

The fact that this is used a lot tells me there is some API deficiency:

In [3]: s = Series()
   ...: s['a'] = 2

In [4]: s
Out[4]: 
a    2
dtype: int64

Series is not a dict and was never intended to be fully substitutable for one -- the lack of an empty method may be the culprit. We should look more at this

jreback commented 7 years ago

related: https://github.com/pandas-dev/pandas/issues/9738