wesm / pandas2

Design documents and code for the pandas 2.0 effort.
https://pandas-dev.github.io/pandas2/
306 stars 41 forks source link

DESIGN: NA values in floating point arrays #46

Open wesm opened 7 years ago

wesm commented 7 years ago

Do we want to continue to use NaN? There are possible computational benefits to doing so, but we have the opportunity with pandas 2.0 to change to using bitmaps everywhere, which brings a lot of consistency to how missing data is handled (for example: isnull becomes free under this scenario). We may want to do some benchmarking to understand the overhead of using bitmaps in arithmetic and aggregations (rather than letting the CPU propagate NaN where relevant).

One benefit of the bitmap route is that we can use hardware popcount to skip null checking on groups of 64 or 128 bits at a time (or more, if there are AVX popcount instructions available, not actually sure about this), so the performance in aggregations may actually get a lot better on mostly non-null data.

jreback commented 7 years ago

I think there are a number of considerations here:

1) we need to do input validation on mutating generally, because we DO allow other missing indicators (e.g. python None), so you could set a float dtyped Array with a list / object array that includes None and/or np.nan. This validation needs to then include a transfer step to the internal Array repr (bitmask / sentinel). performance is a bit less sensitive here. 2) are float32 and float64, NaN identical? 3) exporting is easy if we guarantee the correct sentinel, no copies need be made; I think this is a prime performance consideration. 4) the code is going to be simpler if we choose only bitmasks (and no sentinels at all), across all dtypes. 5) There can be some upstream confusion if you have bitmasks AND allow np.nans, meaning it is a value, but traditionally is a null, so you have 2 concepts of null then.

A possibility is actually to fill in np.nan where we have nulls (e.g. when set / created); this way we have the advantages of zero-copy export & general advantages of bitmasks, at the cost of some complexity and having to keep the bit / value in sync.

prob just muddying the waters......

wesm commented 7 years ago

We also will consistently want to track null counts, so having a single code path for doing anything with nulls would be nice (i.e. bitmaps everwhere). We can think some more about this. If we're making a performance argument for continuing to use NaN, we should definitely have the hard numbers before making a decision based on that.

wesm commented 7 years ago

My inclination on thinking some about this would be to use bitmaps for everything. The downside of this is that people have code littered with

s[...] = np.nan

they could change this to

s[...] = pd.NA

to get the same behavior. We could always alias NaN to NA, or make it configurable so that people can run their code and get a warning if they're assigning NaN values somewhere in their code.

jreback commented 7 years ago

why would this matter? we need to check nulls on assignment anyhow (what if I assign None)? or try to do some incorrect type assignment

leifwalsh commented 7 years ago

I support bitmaps everywhere, for what it's worth.

wesm commented 7 years ago

@jreback we could make the design concession that NaN is treated as NA by default, so that these are all equivalent

from pandas import NA
s[...] = NA
s[...] = np.NaN
s[...] = None

I know that this is already the way things are (save for the lack of an NA singleton value), but would want to make sure that we don't wish to introduce a distinction between NaN and NA. I'm kind of betting not given the amount of legacy code out there that uses NaN to mask out values.

chrisaycock commented 7 years ago

How will the NA appear in an array when displayed to the user? Currently we have

[1.0, nan, 3.0]

Is the plan that any numerical array or column will have NA in it when displayed?

jreback commented 7 years ago

well as I said above, we have coerce on the setitem (e.g. translate and null-like to pa.NaN) or use a bit-mask or both, no matter how its internally represented.

I think that's an orthogonal, mainly perf decision.

E.g. also consider if pd.NaT is set (on a datetimelike), same principle. we currently accept None & np,nan for that matter. Just a convenience feature. If we are going to eliminate using a generic np.nan/None to set missing values then that is a larger discussion about changing the API.

wesm commented 7 years ago

@chrisaycock not sure, but my guess is the display will become NA rather than exposing the internal representation to the user

wesm commented 7 years ago

@jreback in setitem we can definitely map a set of sentinel values (NaN, NaT, None) to NA, it's more a memory representation question -- if incoming data has NaNs we will need to construct a bitmap and then ignore the NaN values from that point forward.

shoyer commented 7 years ago

Even if we use a separate bit mask, we could do zero copy conversions to arrays with NaNs in most cases, because pandas won't care about these values, so we can modify the original array (as long as we have ownership). We could even set a flag so we don't have to do this more than once. I think this solves most of the issues for exporting data.

wesm commented 7 years ago

That's a good point. Mutating the null slots is safe, so that would safe on memory / copying

jreback commented 7 years ago

@shoyer that was my point 5) however, you introduce more complexity in the code. Then you have to make sure to actually set the value if a bit is changed from not-null to null internally (rather than just do it once on exporting). It may not be that big of a deal (and may or may not matter for perf). I would vote for keeping the code as simple as possible and so having only bitmaps may be preferable.

shoyer commented 7 years ago

that was my point 5) however, you introduce more complexity in the code. Then you have to make sure to actually set the value if a bit is changed from not-null to null internally (rather than just do it once on exporting).

We're already planning on having flags that are invalidated whenever there is mutation (#27). This would just be another one, e.g., null_values_are_nan. If that flag is unset or false, then exporting a float array to numpy would trigger setting all null values to NaN and set the flag to true. (If we don't have ownership over the source buffer, then we'd make a copy first.) This feels pretty straightforward to me -- you don't need to worry about maintaining both types of NA markers, just invalidating one more flag whenever values change.

chris-b1 commented 7 years ago

No idea if this is a representative (probably not even particularly good code) but I was curious so here's a reduction microbenchmark. This is MSVC, with a only a 64 bit popcnt, but at least in this case checking NaN was cheaper in any case.

https://gist.github.com/chris-b1/59743744458a2e0a628f84a817b14a50

N = 100000000 missing = 0 blocks = 1
Bitmask / NAN = 1.14141
N = 100000000 missing = 0 blocks = 2
Bitmask / NAN = 1.0297
N = 100000000 missing = 0 blocks = 4
Bitmask / NAN = 1.05

N = 100000000 missing = 0.01 blocks = 1
Bitmask / NAN = 1.30476
N = 100000000 missing = 0.01 blocks = 2
Bitmask / NAN = 1.34951
N = 100000000 missing = 0.01 blocks = 4
Bitmask / NAN = 1.40952

N = 100000000 missing = 0.05 blocks = 1
Bitmask / NAN = 1.40559
N = 100000000 missing = 0.05 blocks = 2
Bitmask / NAN = 1.3557
N = 100000000 missing = 0.05 blocks = 4
Bitmask / NAN = 1.34058
wesm commented 7 years ago

I get somewhat different results on my machine (1.2 ghz core m5, clang 8.0)

$ gcc -O3 -msse4.2 -std=c++1y -lc++ bitmap-test.cc 
22:39 ~/code/tmp $ ./a.out 
N = 100000000 missing = 0 blocks = 1
Bitmask / NAN = 0.820946
N = 100000000 missing = 0 blocks = 2
Bitmask / NAN = 0.747475
N = 100000000 missing = 0 blocks = 4
Bitmask / NAN = 0.8
N = 100000000 missing = 0.01 blocks = 1
Bitmask / NAN = 0.81982
N = 100000000 missing = 0.01 blocks = 2
Bitmask / NAN = 1.02516
N = 100000000 missing = 0.01 blocks = 4
Bitmask / NAN = 0.98503
N = 100000000 missing = 0.05 blocks = 1
Bitmask / NAN = 1.28716
N = 100000000 missing = 0.05 blocks = 2
Bitmask / NAN = 1.1372
N = 100000000 missing = 0.05 blocks = 4
Bitmask / NAN = 1.26923
wesm commented 7 years ago

My understanding is that AVX2 instructions are even faster for popcnt, so we can check larger blocks

jreback commented 7 years ago

nice writeup by @ResidentMario http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html

dhirschfeld commented 7 years ago

Just cross-referencing in case there's anything to be learnt from previous discussions: https://docs.scipy.org/doc/numpy-dev/neps/missing-data.html

(though it's also linked to from @ResidentMario's blog)

wesm commented 7 years ago

FWIW, my 6-years-later digest of NumPy's problems with missing data is that NumPy's user base is too heterogeneous. pandas's user base of "analytics" users is a smaller [EDIT: I had a typo "small"... pandas's user base is not small] subset of NumPy, but is also part of a broader ecosystem of tools, including SQL engines, parts of R, etc.

I also think that efforts to unify tensor memory layout (aka ndarrays) with columnar/tabular memory are deeply problematic. I don't see a great deal of overlapping needs between the in-memory columnar problem (i.e. 95-99% of how pandas is used) and the tensor problem (i.e. how NumPy is used for scientific computing a la MATLAB / TensorFlow).

For us, I think that simplifying the world to one-dimensional contiguous arrays with bitmaps to encode validity / nullness will make pandas much easier to understand and develop.

If you can do a zero-copy cast from numeric data to a tensor / ndarray and use math functions in NumPy, then apply a bitmap to the result (also zero-copy), then you don't have the "throw the baby out with the bathwater" problem.