Open keewis opened 1 year ago
I have a feeling that this is again a reason for implementing separate
pint.QuantityArray
andpint.QuantityScalar
types
If we can make sure the numpy
facet always returns that suggested QuantityArray
(or simply defaults to force_ndarray_like=True
for the results of numpy
calls) that would align with the direction of the array API, which requires arguments to be arrays where before it could have been anything castable to an array.
Given that, in most cases, users of both
xarray
andpandas
work with arrays anyways, it feels like forcingforce_ndarray_like = True
on every user is rather harsh.
I'm not saying that I insist on setting that option, it's just that I don't think there is a way to keep xarray
working with pint
without it. As soon as that is not necessary anymore (for example with the suggestion above) I will happily remove the modification of the registry. Doing this at import time, though, is something I would like to keep, as that avoids having to set it up manually or having to modify the registry before the first use of pint-xarray
.
By modifying global state at import time, the behaviour of my program depends on if pint-xarray
is imported or not - independent of if the code importing pint-xarray
is ever used. This is generally frowned-upon, as evidenced by the result of an online search for "import-time side-effects", e.g.:
In our specific case, it breaks code that serialises python quantity scalars to json, which has a custom json converter for pint.Quantity
, but none for numpy array scalars - caused by an innocent looking change to an import in a base library module. The end effect is that we have to ban importing pint_xarray
in any of our library modules, and allow it only from application top-level imports.
Would it be possible to at least delay automatic patching of the registry to the first actual use of anything pint_xarray
, rather than import time? Of course, it's still modifying global state which potentially causes headaches, but at least not triggered by the mere import. Modifying global state is always a risk, which is the reason units registries have been implemented in the first place.
As I said, I'm not sure what to do about this. However, I really would like to avoid doing this on first use of the .pint
accessor, because that will make the situation even more confusing:
import pint_xarray
q = pint.Quantity(0, "m")
ds1 = xr.Dataset({"a": ("x", q)}) # UnitStrippedWarning
ds2 = xr.Dataset({"a": ("y", 0, {"units": "m"})}).pint.quantify() # works, since it patches the registry
ds3 = xr.Dataset({"a": ("x", q)}) # works
whereas clearly stating that the import patches the registry makes the behavior a lot easier to understand, both as a user and when reading / modifying the code (even though I agree that modifying global state is not good design and should be avoided as much as possible).
Considering that, I think our time is best spent figuring out how we can change pint
(or xarray
) so we can get rid of the workaround.
In this particular case, though, I wonder if you have considered extending the custom json converter to treat 0d numpy arrays the same as scalars? That would allow you to side-step the issue entirely. For example, you could try normalizing the object with:
normalized = getattr(obj, "tolist", lambda: obj)()
or
if hasattr(obj, "tolist"):
normalized = obj.tolist()
else:
normalized = obj
Of course, the fix in our code wouldn't be particularly difficult. One can argue that any generic code dealing with a pint quantity today will have to be prepared for array scalars - if it isn't, it's broken.
Considering that, I think our time is best spent figuring out how we can change pint (or xarray) so we can get rid of the workaround.
You are probably right.
I'm again very much tempted discard pint-xarray
completely because of the tons of things that break because of the force_ndarray_like = True
. pandas
is basically broken with this setting. I don't remember the details of each and every time I had to jump through hoops in the past months, but now I came across a particularly severe one. I will now just add each specific case I encounter from now on in the thread of this issue, so we at least have a list.
In this particular case, I have a pint_pandas
array in a Series with a time-index. Resampling that data injects pd.NA
. With force_ndarray_like=True
, these end up as np.array(pd.NA,dtype=object)
0-dim arrays even if I do series.pint.m_as(...)
to remove the unit. It is almost impossible to get rid of that NA
now, being hidden behind the array. The series is now a dtype=object
. pd.to_numeric(series)
has no effect. Neither has series.replace({pd.NA:np.nan})
. Any aggregation (such as series.sum()
) fail, because pandas internally does not manage to get rid of those NA's and then fails to perform numeric operations.
It may not be possible to fix that from within pint_pandas
, because these issues appear when pandas
iterates over individual array elements which are no pint_pandas.PintArrays
at all. Those are plain quantities, and pandas doesn't expect to find zero-dimensional arrays of object type inside an array because that just doesn't make any sense.
It's just that: Requiring a change to a global setting make pint-xarray
a bad player, like a library that only works when the operating system is set to the C
-locale (or the EN_US
locale for that matter).
What was the reasoning again that led to the conclusion that force_ndarray_like=True
was needed in the first place? I just tested disabling force_ndarray_like=True
(and the corresponding check) in pint-xarray
, and the exact same number of tests pass (244 passed, 12 skipped, 1 warning on my machine under python 3.12.0).
I'll submit a pull-request shortly, so we may get some feedback from CI as-well.
The workflows will require some approval before we get CI feedback...
the relevant issues are hgrecco/pint#950 and hgrecco/pint#1342. You won't see this in pint-xarray
's test suite because technically this is a property of xarray
(pint-xarray
is just a glue / convenience package), so you'd have to modify and run xarray
's testsuite to find the failing tests. We've been discussing to also have pint-xarray
do a full test / move everything to pint-xarray
, but I didn't have time to actually do the work for this.
As a summary, though, the issue is this:
In [1]: import pint
...: import xarray as xr
...:
...: ureg1 = pint.UnitRegistry()
...: ureg2 = pint.UnitRegistry(force_ndarray_like=True)
In [2]: import pint
...: import xarray as xr
...:
...: ureg1 = pint.UnitRegistry(force_ndarray_like=True)
...: ureg2 = pint.UnitRegistry()
In [3]: xr.Dataset({"a": ((), ureg1.Quantity(1, "m"))})
Out[3]:
<xarray.Dataset>
Dimensions: ()
Data variables:
a int64 <Quantity(1, 'meter')>
In [4]: xr.Dataset({"a": ((), ureg2.Quantity(1, "m"))})
...
AttributeError: Neither Quantity object nor its magnitude (1) has attribute 'shape'
While I'm sure we could define shape
(and ndim
and dtype
) with a fallback, this is just a symptom: by defining __array_function__
, __array_ufunc__
and ndim
, shape
, and dtype
, pint.Quantity
appears to be a duck array, but when wrapping a python float
or int
that's not actually the case. force_ndarray_like=True
changes pint.Quantity
such that it is always a duck array, which is why we're setting it.
force_ndarray_like=True
changespint.Quantity
such that it is always a duck array, which is why we're setting it.
So arguably the root problem here is that pint
is playing badly in the ecosystem by providing an array-like class which although it purports to be a "duck array" in fact:
Is that the case? Or is the way in which pint is divergent not actually covered by the (current) array API standard?
pint
does not implement the array API, at all: all of the work that made it a "duck array" happened before that standard even existed, so xarray
support is via __array_ufunc__
and __array_function__
. Note that the array API also requires 0d arrays instead of scalars (but requires raising instead of converting silently), so that would not resolve the issue with pint-pandas
.
pint
somewhat recently introduced "facets", which internally are pretty complicated (I so far was not able to understand the way they work), but nicely separate the different parts of the registry. This means that you have one "facet" responsible for scalar / standard python quantities and one for numpy
(among others). This is probably something to discuss on the pint
issue tracker, but I wonder if it should be possible to select facets depending on the use case: that way, pint-pandas
would be able to select the scalar facet when needed without being affected by the settings for the numpy
facet (I might be misunderstanding the use of pint
in pint-pandas
, though)
Thank you @keewis for all the details.
pint does not implement the array API, at al
Array API support in pint has been suggested here https://github.com/hgrecco/pint/issues/1592, but I don't see any evidence of progress yet.
That should really be the long-term aim though - pint should support the array API, so pint-xarray can stop setting force_ndarray_like=True
, and pint-pandas can hopefully separately get rid of whatever workarounds it currently has.
very true, Array API support in pint
would remove the need for force_ndarray_like=True
. This is a huge undertaking, though, so not sure when we will get to that (my personal priority right now is #163 then #143)
It's also really on the pint devs, not on the xarray team. (I'm also very happy to see that you're working on #163, and happy to help out with finishing #143 after that!)
In the meantime what's the best way forward here?
Does this problem only come up at the point we call ds.pint.quantify
? If so can we deal with it using a context manager approach inside .quantify
like suggested here? Presumably once the data has been coerced into a wrapped 0d array xarray wouldn't convert the data back to a wrapped scalar?
So arguably the root problem here is that pint is playing badly in the ecosystem by providing an array-like class which although it purports to be a "duck array" in fact...
I believe it may be more complicated than that. pint
should not purport to be a "duck array" in the first place unless it really semantically provides an array. The issues with pandas
arise whenever it falls back to iterating over arrays, where it expects to see scalars, but the force_ndarray_like
scalars now don't behave as such. So we have:
pint
fundamentally wants to support both scalar and array use-casesxarray
needs everything that looks like a duck array to behave like an arraypandas
(and many other cases not involving pandas; round(1*u.m/(2*u.mm))
, json.dumps(2*u.mm)
, ...) expect scalars to behave like scalars (which array-like scalars don't)Thus, I don't think there is all too much that pint-pandas
can do to solve the issue.
The fundamental problem indeed remains pint
whose quantities look like arrays even when they aren't. It has come up over and over again to split Quantity into separate classes for arrays and scalars (https://github.com/hgrecco/pint/issues/753, https://github.com/hgrecco/pint/pull/764, https://github.com/hgrecco/pint/pull/875, https://github.com/hgrecco/pint/issues/1128, ...). However, the introduction of "facades" basically made previous progress on a PR to that end useless, and I never had the time to dive deep enough into facades to help working that out. Unfortunately, it seems that nobody has the resources to solve that quickly.
Given that so many things break when scalars do not behave like scalars, I wonder if there is really no way to introduce a workaround around pint
's issues in pint-xarray
instead, potentially with a bit of help of xarray
itself. To me at least it appears the only actual issue is that it currently fails coerce a scalar quantity into a zero-dimensional array (as in xr.Dataset({"a": ((), ureg2.Quantity(1, "m"))})
). To me, one could even argue that this is correct by design: You ask xarray
to store a 0-dimensional array, but provide something which isn't an array. It appears to me that it is xarray
which is fooled by the looks of scalar quantities, so we should be looking to work around that in xarray
/ pint-xarray
rather than expecting every other scalar library to deal with arrays.
Moving to a separate issue to be able to properly track this.
Originally posted by @burnpanck in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1532145519
Originally posted by @keewis in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1542549750
Originally posted by @burnpanck in https://github.com/xarray-contrib/pint-xarray/issues/189#issuecomment-1542765093