xarray-contrib / xarray-simlab

Xarray extension and framework for computer model simulations
http://xarray-simlab.readthedocs.io
BSD 3-Clause "New" or "Revised" License
73 stars 9 forks source link

Sparse arrays #165

Open jvail opened 3 years ago

jvail commented 3 years ago

:+1: for the new release.

I was wondering if it would be laborious to support sparse (https://github.com/pydata/sparse) arrays. I have an issue with this line

https://github.com/benbovy/xarray-simlab/blob/master/xsimlab/stores.py#L193 dtype = getattr(value, "dtype", np.asarray(value).dtype)

I guess in case of sparse it would just be: dtype = getattr(value, "dtype", value.dtype)

But maybe after that line there are other issues as well. I just wanted to ask asap if you think sparse could be easily supported with a few tweaks. Because if not then I might just have to skip the idea of using sparse arrays at all.

Thank you, Jan

benbovy commented 3 years ago

Sorry for the late reply @jvail !

Yes it would be nice if xarray-simlab could support sparse arrays (I haven't tried yet). Xarray-simlab uses zarr to store the model outputs and uses xarray.open_zarr to load back the zarr store as a xarray Dataset. So the question is whether zarr and xarray.open_zarr would easily support that. There seems to be interest in that, but I don't think it could be supported with a few tweaks: https://github.com/zarr-developers/zarr-python/issues/152, https://github.com/pydata/sparse/issues/222, https://github.com/zarr-developers/zarr-python/issues/424.

jvail commented 3 years ago

Sorry for the late reply @jvail !

No need to be sorry!

Alright - Thank you. I'll try to read the zarr related stuff and try a dev install of simlab to see how far I can get.

jvail commented 3 years ago

Hi @benbovy ,

here is a test for some minimal sparse support: https://github.com/jvail/xarray-simlab/commit/107d2e0a2bebff4706043d9708f4fba3e6f2b53d#diff-932f643c875a75e89dbf205d2764a1ac25b260f5c12503543242f833e8359a62

It turns out it is impossible to use sparse with intent='inout' because zarr wont accept it and if you auto densify you are back at square one because you would have to "re-sparse" after reading the input.

It works well internally - meaning inside a process without having the variable either in input_vars or output_vars - without any changes in the source code.

If the sparse variable is a model output (in output_vars) you would have to call todense() on it (see my commit). Not sure if you want that. But at least it would provide a bit of support.

Another way to handle sparse in outputs could be to just drop all sparse when writing to zarr and issue a warning.

benbovy commented 3 years ago

Hi @jvail,

Thanks for your example.

Calling todense() before writing to zarr (and issue a warning) seems reasonable to me. It's better than no support at all.

Hopefully, better support for sparse arrays (and maybe other array backends) will be fixed (in zarr and xarray) upstream in the future.

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

jvail commented 3 years ago

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

Thank you Benoit. That sounds like a pretty good idea. Should work with 'inout' as well - I'll give it a try.

jvail commented 3 years ago

Salut @benbovy,

if you find time please take a look at this commit/branch: https://github.com/jvail/xarray-simlab/commit/cfa37a69796985beca19ffadcef70d8005ccb1d7

Instead of adding two zarr items - as you suggested - I tried to use the DOK format which nicely translates into a structured numpy array. That in turn is digestible by zarr.

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

There is a little notebook as well.

benbovy commented 3 years ago

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

jvail commented 3 years ago

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

hmpf - that is beyond my reach. But maybe we can can get around it by injecting some encoding/decoding options - I hope.

benbovy commented 3 years ago

I just had a look at your https://github.com/jvail/xarray-simlab/commit/cfa37a69796985beca19ffadcef70d8005ccb1d7 branch. That's nice! I think that such support for sparse arrays would be a nice addition in xarray-simlab before proper support for sparse is available in zarr (and zarr->xarray).

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

jvail commented 3 years ago

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

Yes, that's true. I'll try to propose something more general. But it seems you are d'accord with this approach - generally speaking? I'd be interested in moving this forward quickly so it might have a chance to be part of the next release.

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

benbovy commented 3 years ago

Yes generally I'm d'accord !

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

I'm afraid there's currently no workaround other than mask_and_scale: False. I don't think Xarray's CFMaskCoder supports fill values defined for structured dtypes, but I guess it wouldn't be hard to support it.

jvail commented 3 years ago

I'm afraid there's currently no workaround other than mask_and_scale: False.

Apparently setting the zarr fill_value to None solves it.

jvail commented 3 years ago

It turned out that we do not desperately need sparse arrays right now. With our current model/data we stay below a "critical" size (wherever that is) with our indices. But if they grow much larger in future and therefore arrays get sparser and sparser support for sparse would be very welcome. Also I'd like to wait until the dust has settled a bit around all the refactoring and new features in open PRs and a new version has been released.