Open wesm opened 8 years ago
Personally, I'd let the type conversion on mutation go away completely, I don't see any upside to that, especially with a first class NA value that doesn't force casts.
More ambivalent on setting with expansion - sometimes that is useful (or at least convenient), e.g., below with a MultiIndex
would be kind of annoying to express with a reindex? Granted this isn't very idiomatic and allowing this does add plenty of complexity to the indexing code.
In [27]: df = pd.DataFrame({'a': [1, 2, 3, 4]},
...: index=pd.MultiIndex.from_product([['a', 'b'], ['k', 'd']]))
In [28]: idx = pd.IndexSlice
In [29]: df.loc[idx['a','j'], :] = 55
In [30]: df
Out[30]:
a
a k 1.0
d 2.0
b k 3.0
d 4.0
a j 55.0
I'm strongly supportive here -- this is a variation of #21.
If we disable automatic reindexing in __setitem__
, for consistency I think we should do the same for __getitem__
, too. We've been quite happy with only an explicit .reindex()
method for doing reindexing in xarray.
@shoyer IIUC xarray
is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.
However there is a special case that needs addressing. In particular, even though this is a discouraged pattern, its used a lot. e.g.
In [3]: s = Series()
...: s['a'] = 2
In [4]: s
Out[4]:
a 2
dtype: int64
so an empty Series really can't default to dtype, and needs to be coerced on first fill. IOW, do we have a concept of .empty
? This is further compilcated by residual dtypes from removals, IOW.
so this is clear
In [1]: s = Series([1,2,3])
In [2]: s[0:0]
Out[2]: Series([], dtype: int64)
but is this the same as Series()
? or is Series()
a dtype of 'empty' (or maybe 'any')?
IOW, what happens if I now add an element to [2]?
@shoyer IIUC xarray is mostly immutable, so this is quite different than pandas, where the core objects are mutable by-default / definition. In any event I agree we have too much magic currently on implicit type conversions.
Xarray is immutable in the same way proposed here for pandas: DataFrame columns can be mutated and existing rows can be mutated, but rows cannot be inserted or removed. This requires users to do things in a reasonable efficient way, because, like pandas, data is ultimately stored in non-resizeable arrays.
Implicit size mutability is problematic because a naive user might expect that data is backed by a data structure that can be efficiently resized, like Python's list
, but this isn't the case (the .append
method is problematic for similar reasons). Now that we are redoing the guts of pandas in C++, we could actually store data in resizeable arrays, but on the other hand that does entail some additional complexity and memory overhead.
The fact that this is used a lot tells me there is some API deficiency:
In [3]: s = Series()
...: s['a'] = 2
In [4]: s
Out[4]:
a 2
dtype: int64
Series is not a dict and was never intended to be fully substitutable for one -- the lack of an empty method may be the culprit. We should look more at this
I haven't been able to wrap my head around these behaviors:
or
On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).
Let me know other thoughts about this.