wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

Separate pd2.NaT for datetime vs timedelta #74

Open jbrockmendel opened 6 years ago

jbrockmendel commented 6 years ago

A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.

chris-b1 commented 6 years ago

I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.

In [132]: import pyarrow as pa

In [133]: pa.array([1, 2, None])
Out[133]: 
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
  1,
  2,
  NA
]

In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA

In [135]: import datetime

In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]: 
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
  Timestamp('2016-12-31 00:00:00'),
  NA
]

In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA

In [138]: type(_)
Out[138]: pyarrow.lib.NAType
jbrockmendel commented 6 years ago

@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?

Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.

chris-b1 commented 6 years ago

Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:

arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues

In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:


In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA

TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'