Zarr extension for stacked / concatenated virtual views

jbms commented 9 months ago

Splitting out discussion from https://github.com/zarr-developers/zarr-specs/issues/287

See Ryan's comment here:

https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-1931017077

Stacking/virtual views are implemented in tensorstore here: https://google.github.io/tensorstore/driver/stack/index.html

A few points that I think need to be addressed:

Specifying cropping and other coordinate transforms (see https://google.github.io/tensorstore/index_space.html#index-transform for inspiration)
URL syntax to support (see https://github.com/zarr-developers/zeps/pull/48 for one possible solution)
Recommendations for avoiding confused deputy problem (https://en.wikipedia.org/wiki/Confused_deputy_problem)
Efficient representation: if you have just a small number of component arrays, then just specifying them in json directly works fine, but if you have a very large number of component arrays, it would be better to have a data structure that enables loading them on demand. The tensorstore stack driver internally constructs a non-regular (i.e. variably chunked) grid in-memory to represent the stack, which enables efficient access; each grid cell contains the tensorstore spec that indicates how to access the data within that cell. This grid could instead be stored as a zarr array (where each element of the zarr array represents one variably-sized grid cell of the concenated array); the data type would be string or json or similar and would somehow specify a reference to (a portion of) another zarr array.

TomNicholas commented 9 months ago

I want to make a more concrete proposal of how a ZEP for concatenation might work, based on some discussions we had in-person during the NYC Zarr Sprint last week, and building off this suggestion https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-1931017077.

Abstract

Provides a way to define arrays in a zarr store as concatenations of other arrays in the store.

Motivation and Scope

A common problem is providing a zarr-compliant interface to sets of existing files, often provided in some legacy file format (e.g. netCDF4). For instance imagine a series of netCDF4 files containing daily meteorological data, written out as one file per month. In general, these netCDF4 files might have different codecs (as compression schemes and parameters change) and variable lengths along the concatenation dimension time (as months have different numbers of days in them). We nevertheless want to make a single Zarr store that represents this entire dataset.

The "chunk manifest" proposal (#287) deals with how we could provide access to the bytes in the legacy files from a zarr store interface, via defining a storage transformer for zarr arrays that redirects requests to fetch specific byte ranges within the netCDF files. (For access to cloud URLs we would also need ZEP008)

However, with one chunk manifest per zarr array (as proposed in #287), and each zarr array conventionally having one set of codecs (and regular chunks), a single standard zarr array cannot represent all the chunks of one variable across the entire time dimension of this hypothetical dataset.

Instead, we propose defining a way to represent concatenation of a series of arrays containing the data in each file, and exposing that concatenated array as part of the store.

Usage and Impact

With this ZEP implementations would be able to define concatenation methods for Zarr arrays (either lazy or eager), which would generate the definition of the new array in the store as a reference to multiple existing arrays.

Together with #287 (and an implementation of both in zarr-python), that would allow us to standardize the "MultiZarrToZarr" workflow pioneered by the kerchunk package, but in a fully-specified, array-level, and language-agnostic way.

Implementations might then be able to write code like

import zarr

a1 = zarr.open("array1.zarr")  # zarr.Array
a2 = zarr.open("array2.zarr")  # zarr.Array 

# this would create a Zarr array with the correct extensions to support indirection to a1 and a2
# (need to figure out the right API)
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")

a3[0] = 1  # write some data...automatically goes into the correct underlying array

With a1 and a2 each pointing to one or more legacy netCDF4 files via a chunk manifest, and a3 exposed as a single concatenated array, this gives us a zarr-native representation of the entire dataset.

Array-like concatenation like this could then open the door to wrapping via more expressive higher-level APIs like xarray, allowing users to go from many netCDFs to a single all-encompassing Zarr store using the same API calls they currently use to open all those netCDFs together using xarray (see https://github.com/pydata/xarray/issues/8699)

ds = xr.open_mfdataset(
    '/my/files*.nc'
    engine='zarrify',  # an xarray IO backend that uses kerchunk to read byte ranges then returns lazy representations of zarr.Array objects with concat defined
    combine='nested',  # using 'by_coords' instead would actually check coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of lazily-concatenated zarr.Array objects

ds.zarrify.to_zarr(store='out.zarr')  # xarray accessor that uses zarr-python to concatenate the lazy zarr arrays and writes the resulting zarr store out (which would conform to this ZEP)

Concatenation of arbitrary-length chunks along the concatenation axis might be another way to get the variable-length chunks functionality proposed in ZEP003 (or ZEP003 might be necessary to support for implementations of this ZEP to be able to concatenate variable-length files).

This proposal might find other use outside of the legacy file context, if for some reason you wanted to create a zarr array that used different codecs for different parts of the array.

Backward Compatibility

This change is fully backwards-compatible - all old data would remain usable. However, zarr arrays written as concatenated arrays would only be readable using v3 implementations that support this ZEP.

Detailed description

This section should provide a detailed description of the proposed change. It should include examples of how the new functionality would be used, intended use-cases and pseudo-code illustrating its use.

A concatenated array "c" would have a different set of metadata under the chunk_grid key:

"name": "concatenated"
"concatenation": {
    "axis": 0,   # int or list of ints
    "arrays": ["../a", "../b"]   # list of relative paths to other arrays in store 
}

Array-specific attributes (such as codecs) would not be specified, as they can be inferred by looking at the metadata of arrays a and b.

Cycles of references to arrays (e.g. "a" refers to "b" which refers to "a") are forbidden.

Multi-dimensional concatenation should be supported and one way would be by requiring "axis" to be an ordered list of integers, and "arrays" to be a nested list (see xarray.combine_nested for a similar syntax).

Q: Should "chunk_shape" be inferred by readers, or written out explicitly like in ZEP003? Q: Should there be any restrictions on which groups these arrays live in (e.g. you can only concatenate arrays in the same group)? Q: Concatenation could alternatively be defined to only work along one axis, and then multi-dimensional concatenation can only be represented as a series of concatenation steps of different arrays within the store (I prefer the list of ints approach though). Q: Could the the list of array paths under "arrays" be stored as a (string) zarr array itself? As this list could potentially contain thousands of entries. I think there is a subtlety here with zarr storing variable-length string dtypes... Q: Is there then any need for parquet? Q: Is this proposal just another type of storage transformer?

Diagram (3) on the right shows where the metadata for array "c" would live (turquoise box), and how it could be defined as a concatenation of arrays "a" and "b" even if "a" and "b" have their own chunk manifests (orange boxes).

Related Work

fsspec + kerchunk (specifically kerchunk.combine.MultiZarrToZarr)
ncml

Implementation

For an implementation to support reading array "c" above for example, it would need to identify arrays that have "name" equal to "concatenated", read the names of the concatenated arrays, translate any request into which of arrays "a" or "b" the requested data must live in, use its normal methods for fetching the required data, possibly concatenate (e.g. using numpy.concatenate), then return the data. The end result for the user would be that they can treat concatenated array "c" as if it were a normal zarr array, i.e. they specify a key and get bytes back, without having to know that those bytes actually came from multiple other arrays in the same store.

Alternatives

Not moving this up into Zarr and instead using the JSON representation used by kerchunk has a few issues, one of which is that it would only allow fsspec-aware clients to read the concatenated arrays, and is also therefore currently not language-agnostic. Another is that kerchunk's interface defines concatenation at the store level, whereas it makes more sense to define it at the array level.
Defining concatenation as the merging of two chunk manifests would not solve the problem of different underlying files having different codecs.

Discussion

This issue: https://github.com/zarr-developers/zarr-specs/issues/288
Chunk manifest: https://github.com/zarr-developers/zarr-specs/issues/287

TomNicholas commented 9 months ago

A few points that I think need to be addressed:

Specifying cropping and other coordinate transforms (see https://google.github.io/tensorstore/index_space.html#index-transform for inspiration) URL syntax to support (see https://github.com/zarr-developers/zeps/pull/48 for one possible solution)

I'm not sure I understand this - does this happen before requesting the key from the store?

Recommendations for avoiding confused deputy problem (https://en.wikipedia.org/wiki/Confused_deputy_problem)

Not sure I understand this either - isn't this problem independent of this proposal, so long as we only allow concatenation of existing arrays within the same store? This seems more to do with how the chunk manifest or URL syntax would work.

Efficient representation

Mentioned above.

jbms commented 9 months ago

A few points that I think need to be addressed:

Specifying cropping and other coordinate transforms (see https://google.github.io/tensorstore/index_space.html#index-transform for inspiration) URL syntax to support (see Add ZEP 8 (URL syntax) draft zeps#48 for one possible solution)

I'm not sure I understand this - does this happen before requesting the key from the store?

Is this question regarding URL syntax? If only arrays within a single store can be concatenated then URL syntax isn't needed. But concatenating arrays that are not in a single key-value store is an important use case. For example you might wish to combine some portion of a public dataset stored on s3 with something generated locally. I suppose this problem could be punted to the key-value store layer by just saying that you have to use a "url" key-value store where everything is part of a single store. But we still need to standardize on the URL syntax to allow different implementations to be compatible with concatenation arrays that specify arrays by URL. Currently nothing in zarr v3 relies on references to anything else in the key-value store so we haven't had to address any of these issues.

Recommendations for avoiding confused deputy problem (https://en.wikipedia.org/wiki/Confused_deputy_problem)

Not sure I understand this either - isn't this problem independent of this proposal, so long as we only allow concatenation of existing arrays within the same store? This seems more to do with how the chunk manifest or URL syntax would work.

Yes the problem doesn't really apply if you can only reference arrays within the same store (especially if you prohibit "../" in the paths), but I imagine that users may often want to concatenate arrays that are not in the same store.

Efficient representation

Mentioned above.

rabernat commented 9 months ago

I think that initially scoping this to arrays within the same store is a great place to start in terms of prototyping.

TomNicholas commented 9 months ago

Is this question regarding URL syntax?

No I meant I don't understand the context of the coordinate transform stuff.

But concatenating arrays that are not in a single key-value store is an important use case.

I hadn't really thought about this. But it's certainly easier to restrict scope for now as Ryan says.

For example you might wish to combine some portion of a public dataset stored on s3 with something generated locally.

If you used a chunk manifest to point to the public dataset then you could create a store that concatenated the remote data with the local data couldn't you?

jbms commented 9 months ago

Q: Should "chunk_shape" be inferred by readers, or written out explicitly like in ZEP003?

In tensorstore it is inferred, but for each array in the stack it is required to specify the domain as part of the JSON spec, so that the overall grid can be determined without doing any I/O. That means that the metadata for each component array (called a "layer" in tensorstore) can be opened lazily on demand.

Therefore I would say that the JSON metadata should definitely be designed to allow the grid to be determined without having to read all of the component arrays, but I don't think the grid needs to be explicitly represented if it can be otherwise inferred (e.g. from a coordinate transform that is specified).

Q: Should there be any restrictions on which groups these arrays live in (e.g. you can only concatenate arrays in the same group)?

I don't think it would be helpful to limit to a single group but confused deputy problem issues should definitely be considered.

Q: Concatenation could alternatively be defined to only work along one axis, and then multi-dimensional concatenation can only be represented as a series of concatenation steps of different arrays within the store (I prefer the list of ints approach though).

I think it would definitely be desirable to do it all as a single virtual zarr array, having possibly many nested virtual zarr arrays to do one multidimensional concatenation/stack would be unfortunate.

However, I would suggest considering the tensorstore approach of "layers" with arbitrary coordinate transformations instead of concatenation. Just plain concatenation does not permit a common use case of stacking e.g. a bunch of 2-d arrays into a 3-d array.

Q: Could the the list of array paths under "arrays" be stored as a (string) zarr array itself? As this list could potentially contain thousands of entries. I think there is a subtlety here with zarr storing variable-length string dtypes...

Yes this would require variable-length data types, so we could defer this until we have standardized that, but should perhaps think about how it could work.

Q: Is there then any need for parquet?

I don't think so.

Q: Is this proposal just another type of storage transformer?

I don't think this fits well as a storage transformer. To make it work with a storage transformer you would also need to make use of ZEP3 variable-size chunking and include the chunking of each component array, which would be annoying.

jbms commented 9 months ago

Is this question regarding URL syntax?

No I meant I don't understand the context of the coordinate transform stuff.

I'm not sure how the concatenation would interact with storage transformers. But basically I'd say the goal should be to allow you to do any nested combination of np.stack or np.concatenate([arbitrary_indexing_op_0(array_0), arbitrary_indexing_op_1(array_1), ...]).

But concatenating arrays that are not in a single key-value store is an important use case.

I hadn't really thought about this. But it's certainly easier to restrict scope for now as Ryan says.

For example you might wish to combine some portion of a public dataset stored on s3 with something generated locally.

If you used a chunk manifest to point to the public dataset then you could create a store that concatenated the remote data with the local data couldn't you?

Yes but the same URL syntax issues apply to the chunk manifest proposal so if it has been addressed there then the same solution could equally well be used for concatenation.

TomNicholas commented 9 months ago

Thanks for all this quick feedback @jbms !

However, I would suggest considering the tensorstore approach of "layers" with arbitrary coordinate transformations instead of concatenation. Just plain concatenation does not permit a common use case of stacking e.g. a bunch of 2-d arrays into a 3-d array.

I agree about stacking - I completely forgot that was a separate function from np.concatenate (in xarray-land they are both just xr.concat). If we are stacking to create new dimensions we should also think about the fact those new dimensions might need to have names written into _ARRAY_DIMENSIONS too.

I'm not sure how the concatenation would interact with storage transformers. But basically I'd say the goal should be to allow you to do any nested combination of np.stack or np.concatenate([arbitrary_indexing_op_0(array_0), arbitrary_indexing_op_1(array_1), ...]).

Huh, so the cropping you were referring to would be an example of an arbitrary indexing operation? That's a pretty big generalization...

Yes but the same URL syntax issues apply to the chunk manifest proposal so if it has been addressed there then the same solution could equally well be used for concatenation.

They are just separate issues aren't they? You solve the URL problems in the chunk manifest, but the concatenated array doesn't care how exactly you solved them, only that there is an array with that name in the store.

rabernat commented 9 months ago

those new dimensions might need to have names written into _ARRAY_DIMENSIONS too.

_ARRAY_DIMENSIONS is unneeded with Zarr V3 since it has named dimensions in the spec.

jbms commented 9 months ago

Thanks for all this quick feedback @jbms !

However, I would suggest considering the tensorstore approach of "layers" with arbitrary coordinate transformations instead of concatenation. Just plain concatenation does not permit a common use case of stacking e.g. a bunch of 2-d arrays into a 3-d array.

I agree about stacking - I completely forgot that was a separate function from np.concatenate (in xarray-land they are both just xr.concat). If we are stacking to create new dimensions we should also think about the fact those new dimensions might need to have names written into _ARRAY_DIMENSIONS too.

I'm not sure how the concatenation would interact with storage transformers. But basically I'd say the goal should be to allow you to do any nested combination of np.stack or np.concatenate([arbitrary_indexing_op_0(array_0), arbitrary_indexing_op_1(array_1), ...]).

Huh, so the cropping you were referring to would be an example of an arbitrary indexing operation? That's a pretty big generalization...

In tensorstore, index transforms provide a relatively compact json representation of any combination of:

numpy indexing by integers, slices, index arrays, or bool arrays (but bool arrays are converted to index arrays)
np.transpose
np.expand_dims

This isn't tied to anything else in tensorstore and I do think it could make a lot of sense to just use the same json representation for zarr a concatenation/stack representation as well.

Yes but the same URL syntax issues apply to the chunk manifest proposal so if it has been addressed there then the same solution could equally well be used for concatenation.

They are just separate issues aren't they? You solve the URL problems in the chunk manifest, but the concatenated array doesn't care how exactly you solved them, only that there is an array with that name in the store.

What I mean is that if we solve the issue for chunk manifest, then we will have a section in the zarr spec that describes how URLs are handled for that case. So we could simply refer to that for the concatenation case as well, and that would, in my opinion, be much more convenient than having to create a chunk manifest representation of an array just to concatenate it. After all, you could similarly restrict the chunk manifest feature to only work in the same store.

In any case I agree that it would make sense to limit it to just the same store for now/for prototyping purposes, since the URL syntax is orthogonal to everything else.

TomNicholas commented 9 months ago

n tensorstore, index transforms provide a relatively compact json representation of any combination of:

So when implemented, the original proposal above implies that implementations might want to provide a lazy serializable concatenatable array, and you're suggesting adding lazy serializable indexing to that too.

The addition of indexing seems like a lot for one ZEP? Can we write this one in such a way that adding indexing later would still be compatible? Or you think it all needs to be done in one go?

I'm just struggling to imagine what a JSON representation of the process of chaining that many operations on that many separate (and intermediate) arrays would be...

be much more convenient than having to create a chunk manifest representation of an array just to concatenate it

I was never suggesting that we needed a chunk manifest representation of an array just to concatenate it! I just want to be able to concatenate any array, no matter whether it used a chunk manifest or URL or inlined data or whatever.

since the URL syntax is orthogonal to everything else.

I think we are on the same page about that.

jbms commented 9 months ago

n tensorstore, index transforms provide a relatively compact json representation of any combination of:

So when implemented, the original proposal above implies that implementations might want to provide a lazy serializable concatenatable array, and you're suggesting adding lazy serializable indexing to that too.

The addition of indexing seems like a lot for one ZEP? Can we write this one in such a way that adding indexing later would still be compatible? Or you think it all needs to be done in one go?

I think they could be potentially be separated, but we should design this proposal to accommodate "inline" composition of virtual views, i.e. without having to actually store the intermediate arrays as separate metadata files.

I do think that without support for coordinate transforms/indexing, a significant fraction of what could be use cases for this virtual concatenation view will not be possible. For example, to get the equivalent of np.stack assuming this virtual view does concatenation you need to add an extra dim. Therefore while it may make sense to standardize virtual indexing views separately from virtual concatenation views, we should plan on adding both.

I'm just struggling to imagine what a JSON representation of the process of chaining that many operations on that many separate (and intermediate) arrays would be...

You can see the tensorstore documentation for one example:

https://google.github.io/tensorstore/python/api/tensorstore.stack.html

be much more convenient than having to create a chunk manifest representation of an array just to concatenate it

I was never suggesting that we needed a chunk manifest representation of an array just to concatenate it! I just want to be able to concatenate any array, no matter whether it used a chunk manifest or URL or inlined data or whatever.

since the URL syntax is orthogonal to everything else.

I think we are on the same page about that.

jbms commented 9 months ago

It occurred to me that concatenation could be supported as an array -> bytes codec in conjunction with ZEP 3 (variable-size chunks):

The codec (maybe called zarr_reference or something like that) would encode each chunk as a reference to another zarr array. By default this would lead to one key in the store per component array. However, if that is undesirable it can be avoided either by:

Combining this zarr_reference codec with the sharding codec, to encode references to multiple component arrays per key in the store.
Combining with the chunk manifest proposal (with support for specifying values inline). The downside of this is that the chunk manifest representation would not really be human readable or writable.

I'm not sure if the codec approach is the right one but it may be worth considering.

TomNicholas commented 8 months ago

For example, to get the equivalent of np.stack assuming this virtual view does concatenation you need to add an extra dim.

I agree that stacking is very important to support, but I don't understand why indexing/coordinate transforms are required to support stacking. Why can't we just have something like

"name": "concatenated"
"stack": {
    "axis": 1,   # int position of new axis
    "arrays": ["../a", "../b"]   # list of relative paths to other arrays in store 
}

and then the extra dim is just listed in the array's dimension_names metadata attribute?

You can see the tensorstore documentation for one example:

https://google.github.io/tensorstore/python/api/tensorstore.stack.html

Okay thank you, I think I'm beginning to understand what you've created in tensorstore with the layers. That's pretty cool!

So we basically we would need a json-serializable representation of chaining arbitary operations, involving multiple intermediate arrays (which don't exist in the store). Is there an example of this kind of thing being done before? Ryan mentioned ncml, and you've mentioned tensorstores layers, but are there other examples to copy?

The codec (maybe called zarr_reference or something like that) would encode each chunk as a reference to another zarr array.

This seems clever, but also rather magical... Also wouldn't it not leave any space for indexing and coordinate transforms? Also also wouldn't this break the idea that you can know the shape of all arrays in the store only by reading metadata? You would actually have to fetch chunk data, decode it, then go back to the store now you know which arrays it refers to and fetch their metadata...

TomNicholas commented 8 months ago

Another possible use case for concatenation that came up talking to someone today: padding with NaNs to make dimensions match.

The use case was zarr-ifying satellite swaths that have dimension lengths that almost but don't quite align. I think we could use concatenation to fix this by creating a VirtualZarrArray which only contained a fill value.

TomNicholas commented 8 months ago

Some things that came up in the ZEP meeting today:

1) Concatenation and (range-)indexing are in some sense inverse operations of one another, so including them together in one ZEP does sort of make sense from that point of view.

2) One kinda unfortunate thing about this proposal is that the arrays to be concatenated will still be present and visible in the store, so the user will see the concatenated array but also potentially a lot of sub-arrays that are really just implementation details. I don't really see how to get around this without either "hiding" some arrays by default (which seems bad) or pointing to arrays in a different store (which could work but seems a bit gross too).

jbms commented 8 months ago

Some things that came up in the ZEP meeting today:

Concatenation and (range-)indexing are in some sense inverse operations of one another, so including them together in one ZEP does sort of make sense from that point of view.

One kinda unfortunate thing about this proposal is that the arrays to be concatenated will still be present and visible in the store, so the user will see the concatenated array but also potentially a lot of sub-arrays that are really just implementation details. I don't really see how to get around this without either "hiding" some arrays by default (which seems bad) or pointing to arrays in a different store (which could work but seems a bit gross too).

I think we could have a concept of "virtual arrays" that exist purely as JSON metadata (this would be very similar to the tensorstore spec concept). We could then define a way to specify these purely as a self-contained JSON object that isn't necessarily stored anywhere. In the concatenation virtual array metadata, the component arrays could then be specified either as paths/urls to other stored arrays, or as inline JSON virtual arrays. This would provide a way to arbitrarily compose virtual arrays without creating separate files for storing them. I would suppose that in these inline JSON virtual arrays, any paths/urls would be interpreted relative to some "base url" implied by the context.

rabernat commented 8 months ago

I think we could have a concept of "virtual arrays" that exist purely as JSON metadata (this would be very similar to the tensorstore spec concept). We could then define a way to specify these purely as a self-contained JSON object that isn't necessarily stored anywhere

This sounds like a good direction!

TomNicholas commented 8 months ago

I'm not sure I understand - are you saying these JSON objects would live outside of a Zarr store entirely?? Or is this suggestion an idea for implementing "hidden arrays"? The arrays we want to "hide" are the ones that have the real data in them (or they point to the real data via chunk manifests).

jbms commented 8 months ago

I imagined that the json object would be essentially interchangeable with a path/url to a stored zarr array, and could therefore be passed directly to e.g. zarr.open but could also be embedded within the metadata of another virtual array.

zarr-developers / zarr-specs