xarray-contrib / pint-xarray

Interface for using pint with xarray, providing convenience accessors
https://pint-xarray.readthedocs.io/en/latest/
Apache License 2.0
105 stars 12 forks source link

avoid setting `force_ndarray_like = True` #216

Open keewis opened 1 year ago

keewis commented 1 year ago

Moving to a separate issue to be able to properly track this.

Why does pint-xarray require pint.application_registry.force_ndarray_like = True in the first place? I don't like when my dependencies force me to force something... This has become an issue because that very same setting is incompatible with pint-pandas: https://github.com/hgrecco/pint-pandas/issues/165. Given that xarray bases on pandas, this seems to be somewhat serious.

Originally posted by @burnpanck in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1532145519

The reason is that xarray detects a duck array using a certain interface, casting everything that does not satisfy that to numpy using np.asarray. Since python scalars (float, int) don't have the ndim, dtype and shape properties (though pint does default ndim to 0), the result of some numpy functions will be cast, dropping the units.

Requiring force_ndarray_like avoids that, since pint will do the casting before xarray can do it. I agree that this feels like a hack, though.

Originally posted by @keewis in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1542549750

Hm. I have a feeling that this is again a reason for implementing separate pint.QuantityArray and pint.QuantityScalar types - because IMHO pint should never respond to np.asarray by stripping units.

Either way, pint-pandas does not require it, and managed to get most operations to work without stripping units. There, it is up to the user to make sure that their objects make it into pandas as a PintArray, e.g. by setting force_ndarray_like = True, or simply by directly building a PintArray. But once the quantity arrays are in, pandas operations do not strip units (at least things improved significantly). Given that, in most cases, users of both xarray and pandas work with arrays anyways, it feels like forcing force_ndarray_like = True on every user is rather harsh. Even more so by doing it at import time - causing an import order dependency!

Originally posted by @burnpanck in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1542765093

keewis commented 1 year ago

I have a feeling that this is again a reason for implementing separate pint.QuantityArray and pint.QuantityScalar types

If we can make sure the numpy facet always returns that suggested QuantityArray (or simply defaults to force_ndarray_like=True for the results of numpy calls) that would align with the direction of the array API, which requires arguments to be arrays where before it could have been anything castable to an array.

Given that, in most cases, users of both xarray and pandas work with arrays anyways, it feels like forcing force_ndarray_like = True on every user is rather harsh.

I'm not saying that I insist on setting that option, it's just that I don't think there is a way to keep xarray working with pint without it. As soon as that is not necessary anymore (for example with the suggestion above) I will happily remove the modification of the registry. Doing this at import time, though, is something I would like to keep, as that avoids having to set it up manually or having to modify the registry before the first use of pint-xarray.

burnpanck commented 1 year ago

By modifying global state at import time, the behaviour of my program depends on if pint-xarray is imported or not - independent of if the code importing pint-xarray is ever used. This is generally frowned-upon, as evidenced by the result of an online search for "import-time side-effects", e.g.:

In our specific case, it breaks code that serialises python quantity scalars to json, which has a custom json converter for pint.Quantity, but none for numpy array scalars - caused by an innocent looking change to an import in a base library module. The end effect is that we have to ban importing pint_xarray in any of our library modules, and allow it only from application top-level imports.

Would it be possible to at least delay automatic patching of the registry to the first actual use of anything pint_xarray, rather than import time? Of course, it's still modifying global state which potentially causes headaches, but at least not triggered by the mere import. Modifying global state is always a risk, which is the reason units registries have been implemented in the first place.

keewis commented 1 year ago

As I said, I'm not sure what to do about this. However, I really would like to avoid doing this on first use of the .pint accessor, because that will make the situation even more confusing:

import pint_xarray

q = pint.Quantity(0, "m")
ds1 = xr.Dataset({"a": ("x", q)})  # UnitStrippedWarning
ds2 = xr.Dataset({"a": ("y", 0, {"units": "m"})}).pint.quantify()  # works, since it patches the registry
ds3 = xr.Dataset({"a": ("x", q)})  # works

whereas clearly stating that the import patches the registry makes the behavior a lot easier to understand, both as a user and when reading / modifying the code (even though I agree that modifying global state is not good design and should be avoided as much as possible).

Considering that, I think our time is best spent figuring out how we can change pint (or xarray) so we can get rid of the workaround.


In this particular case, though, I wonder if you have considered extending the custom json converter to treat 0d numpy arrays the same as scalars? That would allow you to side-step the issue entirely. For example, you could try normalizing the object with:

normalized = getattr(obj, "tolist", lambda: obj)()

or

if hasattr(obj, "tolist"):
    normalized = obj.tolist()
else:
    normalized = obj
burnpanck commented 1 year ago

Of course, the fix in our code wouldn't be particularly difficult. One can argue that any generic code dealing with a pint quantity today will have to be prepared for array scalars - if it isn't, it's broken.

Considering that, I think our time is best spent figuring out how we can change pint (or xarray) so we can get rid of the workaround.

You are probably right.

burnpanck commented 11 months ago

I'm again very much tempted discard pint-xarray completely because of the tons of things that break because of the force_ndarray_like = True. pandas is basically broken with this setting. I don't remember the details of each and every time I had to jump through hoops in the past months, but now I came across a particularly severe one. I will now just add each specific case I encounter from now on in the thread of this issue, so we at least have a list.

In this particular case, I have a pint_pandas array in a Series with a time-index. Resampling that data injects pd.NA. With force_ndarray_like=True, these end up as np.array(pd.NA,dtype=object) 0-dim arrays even if I do series.pint.m_as(...) to remove the unit. It is almost impossible to get rid of that NA now, being hidden behind the array. The series is now a dtype=object. pd.to_numeric(series) has no effect. Neither has series.replace({pd.NA:np.nan}). Any aggregation (such as series.sum()) fail, because pandas internally does not manage to get rid of those NA's and then fails to perform numeric operations.

It may not be possible to fix that from within pint_pandas, because these issues appear when pandas iterates over individual array elements which are no pint_pandas.PintArrays at all. Those are plain quantities, and pandas doesn't expect to find zero-dimensional arrays of object type inside an array because that just doesn't make any sense.

burnpanck commented 11 months ago

It's just that: Requiring a change to a global setting make pint-xarray a bad player, like a library that only works when the operating system is set to the C-locale (or the EN_US locale for that matter).

What was the reasoning again that led to the conclusion that force_ndarray_like=True was needed in the first place? I just tested disabling force_ndarray_like=True (and the corresponding check) in pint-xarray, and the exact same number of tests pass (244 passed, 12 skipped, 1 warning on my machine under python 3.12.0).

I'll submit a pull-request shortly, so we may get some feedback from CI as-well.

burnpanck commented 11 months ago

The workflows will require some approval before we get CI feedback...

keewis commented 11 months ago

the relevant issues are hgrecco/pint#950 and hgrecco/pint#1342. You won't see this in pint-xarray's test suite because technically this is a property of xarray (pint-xarray is just a glue / convenience package), so you'd have to modify and run xarray's testsuite to find the failing tests. We've been discussing to also have pint-xarray do a full test / move everything to pint-xarray, but I didn't have time to actually do the work for this.

As a summary, though, the issue is this:

In [1]: import pint
   ...: import xarray as xr
   ...: 
   ...: ureg1 = pint.UnitRegistry()
   ...: ureg2 = pint.UnitRegistry(force_ndarray_like=True)

In [2]: import pint
   ...: import xarray as xr
   ...: 
   ...: ureg1 = pint.UnitRegistry(force_ndarray_like=True)
   ...: ureg2 = pint.UnitRegistry()

In [3]: xr.Dataset({"a": ((), ureg1.Quantity(1, "m"))})
Out[3]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    a        int64 <Quantity(1, 'meter')>

In [4]: xr.Dataset({"a": ((), ureg2.Quantity(1, "m"))})
...
AttributeError: Neither Quantity object nor its magnitude (1) has attribute 'shape'

While I'm sure we could define shape (and ndim and dtype) with a fallback, this is just a symptom: by defining __array_function__, __array_ufunc__ and ndim, shape, and dtype, pint.Quantity appears to be a duck array, but when wrapping a python float or int that's not actually the case. force_ndarray_like=True changes pint.Quantity such that it is always a duck array, which is why we're setting it.

TomNicholas commented 11 months ago

force_ndarray_like=True changes pint.Quantity such that it is always a duck array, which is why we're setting it.

So arguably the root problem here is that pint is playing badly in the ecosystem by providing an array-like class which although it purports to be a "duck array" in fact:

  1. Does not conform to the python array API standard by default and
  2. Can only be made to conform via changing a global flag that has all sorts of undesired side-effects.

Is that the case? Or is the way in which pint is divergent not actually covered by the (current) array API standard?

keewis commented 11 months ago

pint does not implement the array API, at all: all of the work that made it a "duck array" happened before that standard even existed, so xarray support is via __array_ufunc__ and __array_function__. Note that the array API also requires 0d arrays instead of scalars (but requires raising instead of converting silently), so that would not resolve the issue with pint-pandas.

pint somewhat recently introduced "facets", which internally are pretty complicated (I so far was not able to understand the way they work), but nicely separate the different parts of the registry. This means that you have one "facet" responsible for scalar / standard python quantities and one for numpy (among others). This is probably something to discuss on the pint issue tracker, but I wonder if it should be possible to select facets depending on the use case: that way, pint-pandas would be able to select the scalar facet when needed without being affected by the settings for the numpy facet (I might be misunderstanding the use of pint in pint-pandas, though)

TomNicholas commented 11 months ago

Thank you @keewis for all the details.

pint does not implement the array API, at al

Array API support in pint has been suggested here https://github.com/hgrecco/pint/issues/1592, but I don't see any evidence of progress yet.

That should really be the long-term aim though - pint should support the array API, so pint-xarray can stop setting force_ndarray_like=True, and pint-pandas can hopefully separately get rid of whatever workarounds it currently has.

keewis commented 11 months ago

very true, Array API support in pint would remove the need for force_ndarray_like=True. This is a huge undertaking, though, so not sure when we will get to that (my personal priority right now is #163 then #143)

TomNicholas commented 11 months ago

It's also really on the pint devs, not on the xarray team. (I'm also very happy to see that you're working on #163, and happy to help out with finishing #143 after that!)


In the meantime what's the best way forward here?

Does this problem only come up at the point we call ds.pint.quantify? If so can we deal with it using a context manager approach inside .quantify like suggested here? Presumably once the data has been coerced into a wrapped 0d array xarray wouldn't convert the data back to a wrapped scalar?

burnpanck commented 11 months ago

So arguably the root problem here is that pint is playing badly in the ecosystem by providing an array-like class which although it purports to be a "duck array" in fact...

I believe it may be more complicated than that. pint should not purport to be a "duck array" in the first place unless it really semantically provides an array. The issues with pandas arise whenever it falls back to iterating over arrays, where it expects to see scalars, but the force_ndarray_like scalars now don't behave as such. So we have:

  1. pint fundamentally wants to support both scalar and array use-cases
  2. xarray needs everything that looks like a duck array to behave like an array
  3. pandas (and many other cases not involving pandas; round(1*u.m/(2*u.mm)), json.dumps(2*u.mm), ...) expect scalars to behave like scalars (which array-like scalars don't)

Thus, I don't think there is all too much that pint-pandas can do to solve the issue.

The fundamental problem indeed remains pint whose quantities look like arrays even when they aren't. It has come up over and over again to split Quantity into separate classes for arrays and scalars (https://github.com/hgrecco/pint/issues/753, https://github.com/hgrecco/pint/pull/764, https://github.com/hgrecco/pint/pull/875, https://github.com/hgrecco/pint/issues/1128, ...). However, the introduction of "facades" basically made previous progress on a PR to that end useless, and I never had the time to dive deep enough into facades to help working that out. Unfortunately, it seems that nobody has the resources to solve that quickly.

Given that so many things break when scalars do not behave like scalars, I wonder if there is really no way to introduce a workaround around pint's issues in pint-xarray instead, potentially with a bit of help of xarray itself. To me at least it appears the only actual issue is that it currently fails coerce a scalar quantity into a zero-dimensional array (as in xr.Dataset({"a": ((), ureg2.Quantity(1, "m"))})). To me, one could even argue that this is correct by design: You ask xarray to store a 0-dimensional array, but provide something which isn't an array. It appears to me that it is xarray which is fooled by the looks of scalar quantities, so we should be looking to work around that in xarray / pint-xarray rather than expecting every other scalar library to deal with arrays.