zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays

https://zarr-specs.readthedocs.io/

Creative Commons Attribution 4.0 International

85 stars 28 forks source link

Remove 'order' from Specs and make 'C' default #126

Closed meggart closed 1 year ago

meggart commented 2 years ago

Following up in the discussion in yesterday's community call I want to suggest to remove or deprecate the 'order' field in the metadata and make 'C' order the default.

I think the main reason is that most if not all datasets published are using 'C' storage order, so it would be hard to argue that this is a feature that a large part of the community needs. In addition, AFAIK zarr-python is currently the only zarr implementation that supports 'F' storage, all other do not support this.

The initial reason to add fortran order as an option was that compression might be more efficient in some cases https://github.com/zarr-developers/zarr-python/issues/7 , but I don't know if that has ever been tested/used systematically.

There was a lot of discussion on the question how programming languages that are column-major by default should deal with the storage order https://github.com/zarr-developers/community/issues/41 and I think some of the confusion can be resolved by just not allowing 'F' storage order and always reverting the dimension order when reading/writing array metadata (just like the current behavior). Note that reverting the dimensions is also the default behavior in all HDF5 and NetCDF packages that I have worked with in column-major languages (R, Matlab and Julia).

jakirkham commented 2 years ago

@constantinpape, do you know how this plays on the N5 side of things?

constantinpape commented 2 years ago

@constantinpape, do you know how this plays on the N5 side of things?

Good question. The n5 data format uses F order. So I assume that F order still needs to be supported by zarr-python in order to read/write n5 data format with zarr-python (but that should not be a factor in this decision, just mentioning it as a tangential fact).

But when writing zarr data format with n5-zarr it writes in C order. (I am not sure if it supports reading F order or has optional support for writing in F order). Maybe @bogovicj would know more.

But in summary I don't see any big concern in removing the 'order' from the spec from the n5 perspective.

bogovicj commented 2 years ago

Correct that n5-zarr writes in C order by default. There was a time when n5-zarr "handled' F order differently, but the current release errs on the side of doing less, and letting downstream applications decide what to do.

On writing, yes it write in C order by default, but it can write in F order as well.

jbms commented 2 years ago

Note: TensorStore and Neuroglancer support both C and F order with zarr.

While it is true that you can always just permute the overall dimensions of the array to achieve the desired storage layout, it seems to me that there is value in being able to control the storage layout independently from the dimension order presented to the user, e.g. in order to present dimensions in an expected canonical order.

On the other hand it would be more natural if the chunk keys used the same order as used within the chunk, which is what you get if you permute the overall dimensions.

Regardless of what is decided as far as allowed storage orders, I would strongly urge that zarr implementations not automatically permute the dimension order relative to what is stored in the array metadata json. I know that Zarr.jl and the zarr-python n5 module do reverse the dimension order, but in my view that is a major source of confusion and interoperability problems and I hope we can avoid that being an issue with zarr v3. As one example of a problem, if other custom attributes are used to specify per-dimension information, then even if the implementation permutes the array dimensions it will not be able to permute custom attributes.

rabernat commented 2 years ago

I would strongly urge that zarr implementations not automatically permute the dimension order relative to what is stored in the array metadata json

I see your point. However, this is contrary to what is done with NetCDF and HDF5 libraries. Those libraries only ever store C order but permute dimension order when presenting data to column-major languages such as Fortran and Julia. This exacerbates the user confusion because those users may think that the storage library natively supports F order when in fact it does not.

I agree that permutation needs to be done consistently with named dimensions.

jbms commented 2 years ago

I can understand the desire for consistency with HDF5 and Netcdf but in my view a key aim of zarr is to correct the shortcomings of HDF5 and we should therefore not allow compatibility with HDF5 to result in sub-optimal design decisions.

I think zarr v3 should support user-specified storage orders, and I also think the confusion and interoperability issues that result from reversing the dimension order in some implementations strongly outweighs any compatibility benefits with existing formats like HDF5.

Dimension names are not necessarily the only per-dimension attribute that users may wish to specify --- for custom attributes the implementation will naturally be unable to permute them leading to a high chance of problems.

I don't know about Fortran, but Julia does support a general StridedArray. Additionally, with codecs, filters, chunking and any number of additional indexing operations applied there is anyway not necessarily much correspondence between the on-disk format and layout of the in-memory arrays that a user may pass to read and write operations.

meggart commented 2 years ago

I know that Zarr.jl and the zarr-python n5 module do reverse the dimension order

As I mentioned above, all HDF5 and NetCDF implementations I have come across in Matlab, R, Octave and Julia revert dimension orders they present to the user compared to how they are returned in the underlying C lib. So I would still have to be convinced that everything breaks when we decide to do this.

I would strongly urge that zarr implementations not automatically permute the dimension order relative to what is stored in the array metadata json

What would be the alternative? I think the only alternatives are to a) transpose the array or to b) present some transposed view to the user (for example a TransposedDimsArray in Julia). However, both have downsides. For a) we allocate another array and pay the computing time to transpose the array, which might be negligible for reading compressed arrays from the cloud but might be significant for uncompressed in-memory arrays. For b) the downside is that many user scripts will not be optimized to this storage layout, you will find lots of loops doing for i in size(x,2); for j in size(x,1) etc, that loop cache-efficiently over the first dimension in the inner loop. This would lead to inefficient code. A common problem of a) and b) is that imo we create a lot of confusion e.g. in geo-science context when we present data in time,lat,lon order although time is the record dimension. When you load data from any other source (HDF5, NetCDF, ArchGDAL), time will be the last dimension, because it is the slowest-varying dimension and this is how Julia presents the data to you. I mean, these other data formats will not go away in the near future, no matter how successful zarr will become, so some consistency is really helpful. An imaginative Julia or R implementation of xarray will always have to support multiple data backends, and it would be a big hassle to deal with backend-specific dimension orders and optimisation schemes for loop orders.

meggart commented 2 years ago

Dimension names are not necessarily the only per-dimension attribute that users may wish to specify --- for custom attributes the implementation will naturally be unable to permute them leading to a high chance of problems.

This is true and for me the only technical reason that speaks against reverting dimensions. Of course, downstream libraries can handle this manually, as is e.g. done here https://github.com/meggart/YAXArrayBase.jl/blob/ef90b2b07e53540d4612711b2f51eeb169dd796c/src/datasets/zarr.jl#L9 where the xarray dimension names are parsed and reverted.

Maybe we can solve this through another convention/addition to the zarr specs? Can we label attributes that refer to the dimensions of an array in a special way? Then all implementations could check that number of entries in this attribute equals the number of dims of the array and for implementations that potentially revert dimension orders it is a sign to revert the order of the attribute.

jbms commented 2 years ago

What would be the alternative? I think the only alternatives are to a) transpose the array or to b) present some transposed view to the user (for example a TransposedDimsArray in Julia)

A Zarr array is already a different representation from an in-memory array --- it is already a form of view. I am simply proposing that any operations on it treat the dimensions in the same order they are specified in the array metadata file --- I wouldn't consider that a "transposed" view. When creating a new zarr array from Julia, it probably would indeed make sense to default to Fortran order. And it is true that if you attempt to read a single chunk that is stored in C order (and not encoded with some special codec that already may transpose the order) into a Julia Fortran order array, then there would indeed be a transpose happening as part of the read operation. But already you have to account for different strides when copying between the stored chunk representation and the output array. I'm not too familiar with Julia, but it seems it would be best if Zarr.jl could support reading/writing directly to/from e.g. StridedArray rather than just the basic DenseArray. Then the user has full control over the resultant data layout, but all transposing happens explicitly rather than implicitly.

An imaginative Julia or R implementation of xarray will always have to support multiple data backends, and it would be a big hassle to deal with backend-specific dimension orders and optimisation schemes for loop orders.

Already with zarr v2 we have support for both Fortran and C order, and therefore it is already important to efficiently support multiple dimension orders. Additionally, in many cases you wouldn't want to work directly with the zarr array itself, but rather may want to work with a view that applies various indexing operations, including permuting the dimensions. So already it is necessary to deal with arbitrary dimension orders.

Maybe we can solve this through another convention/addition to the zarr specs? Can we label attributes that refer to the dimensions of an array in a special way?

I think this is a great idea, because it also allows transformed views of various sorts to correctly transform the attributes.

But I still think it will be problematic to cover all of the cases --- e.g. maybe you want to store a matrix where both the rows and columns are indexed by dimensions, or you are storing an array of coordinates, etc. Much better to force the user to explicitly create a transposed view if they want to, that way they are aware that the transpose is happening and will be more prepared to transpose any attributes and coordinates as well.

meggart commented 2 years ago

When creating a new zarr array from Julia, it probably would indeed make sense to default to Fortran order. And it is true that if you attempt to read a single chunk that is stored in C order (and not encoded with some special codec that already may transpose the order) into a Julia Fortran order array, then there would indeed be a transpose happening as part of the read operation. But already you have to account for different strides when copying between the stored chunk representation and the output array. I'm not too familiar with Julia, but it seems it would be best if Zarr.jl could support reading/writing directly to/from e.g. StridedArray rather than just the basic DenseArray. Then the user has full control over the resultant data layout, but all transposing happens explicitly rather than implicitly.

I think that the operation you describe here, reading a single chunk that is stored in C order is something that happens at least as often as reading many chunks of a Zarr array into a single array. Any efficient mapreduce-like operation will read the data chunk by chunk and operate on these chunks of data, so this is not a rare case. And regarding your second point, yes there are data structures in Julia that can represent transposed views of arrays (although not StridedArrays, this is just a name for a union of a lot of array types (Views, Reshapes etc) that can be passed to blas for strided operations). However, my point was that we can not assume that the rest of the ecosystem can deal with these arrays as good as it can deal with DenseArrays. I would claim this is true for python as well, I don't know how many libraries that use numpy arrays will be written with optimizations for the case that one might have Fortran-ordered arrays. So I completely agree with you on the following point:

Much better to force the user to explicitly create a transposed view if they want to,

This is exactly what I am saying: Let us present the data to the user as-is, in the native array type of the respective programming language and if the PL supports different orderings (I doubt that JavaScript does this), then let the user decide to use them.

I think we have similar design goals in mind but we just disagree on the question if the same zarr array should "look" exactly the same in every programming language. To me, as someone switching between Python and Julia, I am completely used to the convention that an x,y,z array in python will appear as z,y,x in Julia. I would be much more surprised by the opposite. However, I can understand that users who have been in only one of these worlds will be surprised when they see this the first time.

jbms commented 2 years ago

Let us present the data to the user as-is, in the native array type of the respective programming language.

The whole idea of zarr is to provide an abstraction --- the data is chunked and encoded, but we are viewing it as a single cohesive array. The conflict here is that the data "as-is" has a given dimension order listed in the metadata, and a given storage order (at least if the chunks are uncompressed or compressed as raw byte streams). Since Julia code is normally written assuming Fortran dimension order, there is sometimes a conflict between presenting a Fortran dimension order and presenting the data "as is".

To me the basic issue is that it is very important to be able to agree on the "dimension order" of an array. It is fine to work with a transposed view, but it should be explicit that you are using a transposed view, so that there is no confusion as to what the "real" dimension order is. With some zarr implementations transposing the dimensions by default, though, the concept of a "real" dimension order becomes entirely lost, and when specifying dimension order you must qualify "dimension order as seen from Julia" vs "dimension order as seen from Python". From discussions with @axtimwalde I understand that problem already exists with the zarr n5 module, which also transposes the dimensions. Users seem to be forced to rely on named dimensions in order to unambiguous identify dimensions. I certainly appreciate the value of named dimensions, but I also think the zarr standard should allow the dimensions to be unambiguously identified without relying on named dimensions.

For an array indexed by xyz perhaps there is already sufficient intuition that Julia normally uses xyz and Python normally uses zyx that this is less confusing, but that assumption may not always hold, for more general arrays there may not even be such intuition that can be relied upon.

I think if the zarr julia implementation forced users to specify a transpose=true option, or similar, when opening an array in order to get the transposed view, my concerns would be satisfied, as that would hopefully nudge users towards still referring to dimensions in the "original" order.

As far as other languages:

I believe that virtually all of the operations built into NumPy itself work equally well on Fortran and C order arrays. However, it is certainly possible that more specialized operations in scipy, scikit-image, etc. may only work efficiently with certain dimension orders.

JavaScript has no built-in support for multi-dimensional arrays, and there is no non-standard library that has reached widespread usage like NumPy. Therefore it is still up to each JavaScript tool/library that support zarr to determine how to represent multi-dimensional arrays. Neuroglancer, for example, by its nature already needs to support arbitrary affine transforms of the data and therefore is agnostic to the dimension order.

I think what I had in mind as far as "strided array" support in Julia is provided by this package: https://github.com/Jutho/Strided.jl

ToucheSir commented 2 years ago

I just wanted to chime in as a user of Zarr-Python and someone who contributes to Julia packages: there is already a precedent for what @meggart describes in other areas where we interop with Python/C. ONNX and DLPack are good examples I'm familiar with, and in both we just reverse the logical order of dimensions.

The problem with a package like Strided.jl is that a) everyone has to use it, and b) it doesn't work with all array types. Thus changing the default ordering of dimensions to look "correct" vis a vis other languages would be at best murderous for downstream performance.

jbms commented 2 years ago

Not being a Julia user myself, I suppose I have no business saying whether I think Julia should use a reversed dimension order or not by default, if that is the convention that users expect.

But I do think there is a real problem to be addressed as far as being able to unambiguously specify dimensions and coordinates in a concise way.

I noticed that zarr.jl also uses 1-based indexing, so there is an additional issue there. We can't just say coordinates (1, 2, 3) in zarr array at path xxx, we instead must say 1-based reverse-order coordinates (1, 2, 3) in zarr array at path xxx, or 0-based zarr spec-order coordinates (1, 2, 3) in zarr array at path xxx.

Do you have any suggestions on what conventions could be adopted to resolve that issue? This is an issue for both textual communication (e.g. emails, chat) as well as machine-readable formats, like attributes, coordinate arrays, etc.

Additionally, I am advocating for the following two proposals:

Supporting arbitrary dimension storage order (https://github.com/zarr-developers/zarr-specs/issues/129), which is basically the opposite of the proposal in this issue
Supporting non-zero origin (https://github.com/zarr-developers/zarr-specs/issues/122)

I would say there are clear use cases for both of these features, and they are not technically difficult to support in any zarr implementation. However, there are some obvious conflicts with the existing zarr.jl API: if zarr.jl is already using an origin of 1 by default it is problematic to support a non-default origin specified in the metadata.

Supporting an arbitrary dimension order causes no API problems in itself but it would mean zarr.jl can no longer assume that reversing the dimension order guarantees that Fortran order matches the chunk storage order.

rabernat commented 2 years ago

Now that #149 is out, I want to return to this discussion. I really appreciate @jbms's thoughtful and detailed comments. Jeremey raises some very good points. I disagree, but I hope I can do so with respect and care.

I think my viewpoint here reflects the same fundamental difference of opinion as I expressed in https://github.com/zarr-developers/zarr-specs/pull/144#issuecomment-1157956438 - I don't believe it is Zarr's job to reconcile different language conventions around how arrays are presented to users. Ultimately this is a question about the scope of Zarr. I am arguing for the core spec to have a smaller scope, limited to simply passing arrays from the storage to the language's natural, idiomatic array representation in the most straightforward possible way.

Not being a Julia user myself, I suppose I have no business saying whether I think Julia should use a reversed dimension order or not by default, if that is the convention that users expect.

For readers of this thread, I think it is super important to note that the actual users of F-major languages (mostly Julia here, but also the Unidata Fortran group) are in support of this proposal (to remove "order" from the spec and make C the default). The main person arguing against it is self-admittedly not a user of F-major languages. Jeremy's arguments are on the grounds of seeking some type of inter-language consistency that Julia users neither want nor expect.

But I do think there is a real problem to be addressed as far as being able to unambiguously specify dimensions and coordinates in a concise way.

I noticed that zarr.jl also uses 1-based indexing, so there is an additional issue there.

Yes, I agree that this issue is very related to 0 vs 1-based indexing. If you support the idea that all Zarr implementations should present the same on-disk array using the same exact shape in both row-major and column-major languages, then it is consistent to also expect Zarr to mediate indexing, such that the Zarr indexing operation array[3, 2] returns the same element in both Julia and Python. I guess this is what is meant by "unambiguously specify dimensions and coordinates".

However, this would also be strongly inconsistent with what actual users of those languages expect. I think it is simply not Zarr's job to resolve the fact that different programming languages use different conventions (row-major vs column-major, 0-based vs 1-based indexing) for arrays. Attempting to do so will introduce undesirable complexity and confusion into Zarr.

A Minimal Example

I think it's useful to have a very minimal example which concisely illustrates the way that Python and Julia represent the same on-disk data. I'll create the data from Python because that's the language I know best.

import numpy as np

# create array using default row-major shape
# the dimension with length 3 is the most contiguous dimension
shape = (2, 3)
data = np.arange(6, dtype='i4').reshape(shape)

# -> array([[0, 1, 2],
#           [3, 4, 5]], dtype=int32)

# write to disk in a very explicit way
as_bytes = data.tobytes(order='C')
assert len(as_bytes) == data.size * data.dtype.itemsize
with open('data_c_order.bin', mode='wb') as fp:
    fp.write(as_bytes)

# verify we can read it back from python
with open('data_c_order.bin', mode='rb') as fp:
    read_byes = fp.read()
data_read = np.frombuffer(read_byes, dtype='i4').reshape(shape)
np.testing.assert_equal(data, data_read)

Now let's read that same data from Julia in the most standard way

# we use the shape 3, 2 because that's what Julia expects: the most contiguous array dimension is FIRST
data = Array{Int32}(undef, 3, 2)
read!("data_c_order.bin", data)

# -> 3×2 Array{Int32,2}:
#     0  3
#     1  4
#     2  5

# to verify that the array is "right", we can iterate it in the idiomatic julia way
for i in eachindex(data)
    print(i, ": ", data[i], "\n")
end

# 1: 0
# 2: 1
# 3: 2
# 4: 3
# 5: 4
# 6: 5

This example concisely illustrates the following points:

An array with shape (2, 3) in Python (row-major) should have the same (3, 2) in Julia. As @meggart noted above, that is what all users would expect (particularly those who use both languages).
The same exact on-disk data is naturally represented as a C-order shape (2, 3) array or an F-order (3, 2) array.
Julia uses 1-based indexing, so element data[j, i] in python is element data[i+1, j+1] in Julia

I believe that Zarr should not attempt to reconcile these differences. On disk, there is really no such thing as C order or F order. Once you pick an order, the shape of the array becomes determined. In the Zarr spec and metadata, I propose we should default to describing everything in C-order conventions. F-order implementations should handle the shape permutation, as already done by Zarr.jl.

jbms commented 2 years ago

I see your example as more a motivation for defining array formats and standards like zarr, that provide an abstraction over the on-disk binary representation, rather than for any particular storage order within zarr. E.g. if we run your same examples but write on a little endian machine and read on a big endian machine, the numbers printed out will differ.

In my view we can say that a particular array in-memory or on-disk has a storage order, but I don't agree that in general we can say that a programming language has a single array storage order, at least for many popular programming languages like Python, C++, and Java, just that perhaps an particular array storage order is more commonly used. I think it is more just a particular characteristic of Julia, Fortran, and matlab that they are oriented around column-major order.

One relevant precedent to consider is image formats: image formats use various on-disk encodings. For example, the BMP format stores pixels with all color channels interleaved, approximately row-major HWC (height, width, channel) order, usually from the bottom up. JPEG uses a transformed representation, with each color channel.encoded separately, that would most closely map to CHW order. Regardless of the stored representation, though, image libraries provide a consistent view of the image.

rabernat commented 2 years ago

Regardless of the stored representation, though, image libraries provide a consistent view of the image.

That's a useful point of comparison, and I think it points clearly at the crux of our disagreement. I don't think it's Zarr's job to do what image libraries do (rearrange dimension order based on a domain-specific data model). I see it as the job of tools higher up in the stack. Zarr implementations should simply expose the data in a way that makes sense for that language.

I think the comparison with big vs. little endian is spurious. No one would ever want a big endian value to be interpreted as a little endian one: the result would be non-sensical. But Julia users want the most contiguous dimension to be the first one, while Python users want it to be the last one. Let's give the users what they want! (Rather than trying to define some abstract, inter-language standard for how to index arrays.)

jbms commented 2 years ago

In my view, Zarr's role is to map the stored representation to an abstract array data model. We had some disagreement over exactly what that data model should be (e.g. whether arrays should be allowed to have a specified/non-zero origin) but I don't think any of those disagreements are relevant to this issue. There seems to be pretty broad agreement that this mapping from stored representation to the abstract array may potentially involve almost arbitrary transformations, either through codecs or through storage_transformers. For example I would certainly like to have zarr support for various image codecs, which means the way image formats treat dimensions becomes directly relevant to zarr. To support all of these types of elaborate transformations, but to specifically exclude the very simple transformation, transpose, doesn't make much sense to me.

However, thinking about this more, assuming, as in #153, that we support a list of codecs/filters rather than just a single compressor, I would say that it would be logical to eliminate chunk_memory_layout and instead just have a transpose codec (that supports an arbitrary dimension permutation). I think this might satisfy everyone's concerns.

rabernat commented 2 years ago

In my view we can say that a particular array in-memory or on-disk has a storage order

Actually, can we? Storage is ultimately one-dimensional: an array is just a sequence of bytes. In order to have a storage order, we need to introduce the concept of a multi-dimensional array. That is done at the level of the programming language or library.

The programming language defines what we mean by "first" and "last" dimension. C/Python do foo[k, j, i], such that the first dimension is accessed via the last index position. (I agree this is highly confusing, but practitioners have learned to accept it.) Fortran/Matlab/Julia do the more natural foo[i, j, k]. This is not just notational; processing code is expected to iterate through the data in a specific order, always doing the innermost loop over the i dimension. (See Julia Performance Tips). If our goal is to get data from disk into memory and then into a computational pipeline in the most efficient way, then we want to stream the bytes on disk into memory, and then into the computational routine, without changing their ordering. This requires us to present the same data to C (Python) vs. Fortran (Julia) using different conventions.

There are only two ways have the same on-disk array have the same shape in both languages:

Explicitly transpose the array in memory in one language (but not the other), by completely reversing the order of the bytes and their contiguity properties. That has a performance cost. I imagine this is what the imaging libraries do.
Create some additional layer of abstraction (e.g. a wrapper) around the array that performs the transposition lazily (e.g. creating an array with order='F' in numpy). That also has a performance cost, processing code may think it is accessing elements contiguously while in fact they are not contiguous.

I don't agree that in general we can say that a programming language has a single array storage order

According to https://en.wikipedia.org/wiki/Row-_and_column-major_order

Programming languages or their standard libraries that support multi-dimensional arrays typically have a native row-major or column-major storage order for these arrays.

rabernat commented 2 years ago

instead just have a transpose codec (that supports an arbitrary dimension permutation)

This is interesting. But would the transpose codec be applied for all implementations (regardless of their "native" order)? If so, how would this solve the problem?

jbms commented 2 years ago

instead just have a transpose codec (that supports an arbitrary dimension permutation)

This is interesting. But would the transpose codec be applied for all implementations (regardless of their "native" order)? If so, how would this solve the problem?

Yes, all implementations would apply it, just like gzip or any other filter/codec.

It would be up to the implementation to decide whether it wants to do anything special with regard to this codec. Julia probably would not do anything special, and would just do a physical transpose in memory. zarr-python might have special logic like:

If the user specified order="F" when creating the array, add a transpose codec as an additional initial codec.
When reading an array with an initial transpose codec, by default return the data to the user in a numpy array with that same dimension order, so that there is no need to perform a physical transpose operation in memory.

Julia would still presumably do its usual dimension reversal, and maintain the invariant that dimension order in Julia is reversed compared to Python and most other implementations.

jbms commented 2 years ago

In my view we can say that a particular array in-memory or on-disk has a storage order

Actually, can we? Storage is ultimately one-dimensional: an array is just a sequence of bytes. In order to have a storage order, we need to introduce the concept of a multi-dimensional array. That is done at the level of the programming language or library.

To be clear, by "array" I was referring to a multi-dimensional array.

The programming language defines what we mean by "first" and "last" dimension. C/Python do foo[k, j, i], such that the first dimension is accessed via the last index position. (I agree this is highly confusing, but practitioners have learned to accept it.)

If we have a numpy array with shape (A, B, C) --- I would certainly say that the first dimension has size A and the last dimension has size C, regardless of whether the numpy array uses C order, Fortran order, or any other order --- I would be very confused if someone said the dimension of size C was the first dimension, and I don't believe I've come across that way of referring to dimensions. If it is using row-major / "C" order / lexicographic order, than I would say that the inner dimension is the last dimension (of size C), and the outer dimension is the first dimension (of size A). If it is using column-order / "F" order / colexicographic order, then I would say that the inner dimension is the first dimension (of size A) and the outer dimension is the last dimension (of size C).

Fortran/Matlab/Julia do the more natural foo[i, j, k]. This is not just notational; processing code is expected to iterate through the data in a specific order, always doing the innermost loop over the i dimension. (See Julia Performance Tips). If our goal is to get data from disk into memory and then into a computational pipeline in the most efficient way, then we want to stream the bytes on disk into memory, and then into the computational routine, without changing their ordering. This requires us to present the same data to C (Python) vs. Fortran (Julia) using different conventions.

For libraries / languages that support only a single memory layout for arrays (which may be true for Matlab and Julia, but is certainly not true of Python), then I would agree that a physical transpose would by default be required as part of the encode/decode process. However, the zarr library could provide the user with a way to create a virtual transposed view of the array. Then users of the library could explicitly request a transposed view of the data, and thereby get the data in their desired order, and if this agrees with the stored order, also avoid any physical transpose operation. At the same time, by making this transpose explicit, it avoids confusion about the dimension order.

There are only two ways have the same on-disk array have the same shape in both languages:

Explicitly transpose the array in memory in one language (but not the other), by completely reversing the order of the bytes and their contiguity properties. That has a performance cost. I imagine this is what the imaging libraries do.

Create some additional layer of abstraction (e.g. a wrapper) around the array that performs the transposition lazily (e.g. creating an array with order='F' in numpy). That also has a performance cost, processing code may think it is accessing elements contiguously while in fact they are not contiguous.

Note that a numpy array with order="F" is not a wrapper: the internal representation used by NumPy is (roughly):

base pointer to the array data
number of dimensions
size of each dimension
stride in bytes for each dimension

If the byte strides happen to exactly match those strides that would be generated for a C order array, then we say the array is C order, and similarly for F order. But there is no preferred order in the representation, other than what is the default value for various array creation functions. A transpose operation (that converts a C order array to a corresponding F order array with reversed dimensions) simply permutes both the dimension sizes and the byte strides.

I don't agree that in general we can say that a programming language has a single array storage order

According to https://en.wikipedia.org/wiki/Row-_and_column-major_order

Programming languages or their standard libraries that support multi-dimensional arrays typically have a native row-major or column-major storage order for these arrays.

I think Julia and Matlab and Fortran definitely have a "native" order.

NumPy supports both orders equally well, though it does have a default.

C and C++ have built-in multi-dimensional array support, but as it is limited to arrays where the size is known at compile time, it is basically unused in practice outside of toy examples or small test cases, and therefore can be ignored.

Eigen (C++ library for multi-dimensional arrays) supports both storage orders but defaults to Fortran order.

Vigra (C++ library for multi-dimensional arrays) mostly just supports Fortran order, though it does also allow views with arbitrary order.

xtensor (C++ library for multi-dimensional arrays) supports both storage orders but defaults to C order.

opencv (C++ library) supports only C order.

Tensorflow supports only C order.

JAX supports C and Fortran order.

TensorStore supports arbitrary storage orders like NumPy, but defaults to C order.

Neuroglancer (JavaScript) supports arbitrary orders, doesn't really have a preferred order at all.

N5 (Java) uses Fortran order.

HDF5 (Java) uses C order (I believe).

rabernat commented 2 years ago

That makes a lot of sense Jeremy. Thanks for the review of all the different language conventions.

I have come across numerical code in Python that does not work with order='F' numpy data. For example numba does not support F-order. So the ecosystem as a whole doesn't quite have equal support for both orders.

I am :+1: on the transpose filter. I am also open to keeping the chunk_memory_layout if that's what others strongly want.

I think it would be great to get some more voices from @zarr-developers/python-core-devs on this important topic.

mkitti commented 2 years ago

I'm chime in with a few anecdotes about Julia, strides, and C-ordered arrays. I'm an active contributor to HDF.jl so some of this does sound familiar. I've also written packages such as NumPyArrays.jl and Napari.jl where these issues arise.

In general, these packages offer mechanisms to access the underlying data via both Julia's conventions as well as foreign conventions. For example, consider how PythonCall.jl, one of the main packages for Julia-Python interop, treats arrays. It provides a generic Py wrapper type. For Py, indexing follows Python's 0-based C-ordered conventions. It also provides a PyArray view interface that allows for 1-based, F-ordered indexing.

julia> using PythonCall

julia> np = pyimport("numpy");

julia> A = np.zeros((3,5), order = 'C'); A[2,1] = 1; A # using Python's 0-based, C-ordered indexing
Python ndarray:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

julia> typeof(A)
Py

julia> B = PyArray(A)
3×5 PyArray{Float64, 2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0

julia> B[3,2] # using Julia's 1-based, F-order indexing
1.0

julia> strides(B)
(5, 1)

julia> A
Python ndarray:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 4.]])

julia> C = np.zeros((3,5), order = 'F'); C[2,1] = 2; C # using Python's 0-based, C-ordered indexing
Python ndarray:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 2., 0., 0., 0.]])

julia> D = PyArray(C) # using Julia's 1-based, F-order indexing
3×5 PyArray{Float64, 2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  2.0  0.0  0.0  0.0

julia> D[3,2]
2.0

julia> strides(D)
(1, 3)

julia> D[3,5] = 9 # note that D is a Julian view of C
9

julia> C
Python ndarray:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 2., 0., 0., 9.]])

julia> E = PermutedDimsArray(D, [2,1])
5×3 PermutedDimsArray(::PyArray{Float64, 2}, (2, 1)) with eltype Float64:
 0.0  0.0  0.0
 0.0  0.0  2.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  9.0

julia> strides(E)
(3, 1)

I don't know about Fortran, but Julia does support a general StridedArray

As was pointed out above StridedArray is actually an Union, also known as a sum-type: https://cs.github.com/JuliaLang/julia/blob/d94ed88696824c82cdb6a4c4f1195c8cf831e278/base/reinterpretarray.jl#L146

Strided.jl is not necessarily needed for strided array support in Julia. The strides method already exists in Base. Rather Strided.jl is a utility to makes it easier to write optimized code.

The immediate consequence of this is that StridedArray does not necessarily encompass all arrays with known strides. If I wanted to introduce a new array type, I cannot make this a subtype of StridedArray. A better way to do this is see if the method strides is applicable and perhaps dispatch on that as a trait.

julia> typeof(B)
PyArray{Float64, 2, true, false, Float64}

julia> typeof(B) <: StridedArray
false

julia> applicable(strides, B)
true

julia> f(x::AbstractArray) = f(x, Val(applicable(strides,x)))
f (generic function with 6 methods)

julia> f(x::AbstractArray, isstrided::Val{true}) = println("Using strided array optimizations...")
f (generic function with 6 methods)

julia> f(x::AbstractArray, isstrided::Val{false}) = println("The array is not strided!")
f (generic function with 6 methods)

julia> f(B)
Using strided array optimizations...

Presently, the ZArray type in Zarr.jl does not define strides. We could add a parameter or a field to ZArray so that we can define strides for ZArray. In this way, we could have a ZArray that supports either C or F ordering just as PyArray does above.

Also note that Julia dispatches indexing based on type. A common type to use in Julia is a CartesianIndex which is used when converting linear indexing to cartesian indexing.

julia> B
3×5 PyArray{Float64, 2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  4.0

julia> CartesianIndices(B)[6]
CartesianIndex(3, 2)

julia> ci = CartesianIndex(3,2)
CartesianIndex(3, 2)

julia> B[ci]
1.0

Thus, another approach would be to define a ZarrIndex type that uses 0-based, C-order conventions:

julia> struct ZarrIndex{N}
           index::Dims{N}
           ZarrIndex(args...) = ZarrIndex(args)
       end

julia> Base.getindex(aa::AbstractArray, zi::ZarrIndex) = getindex(aa, (reverse(zi.index) .+ 1)...)

julia> zi = ZarrIndex(1,2)
ZarrIndex{2}((1, 2))

julia> B[zi]
1.0

With regard to indexing, Julia is actually incredibly flexible. Above I have outlined two mechanisms to address foreign indexing conventions.

Create a new array type that has distinct custom indexing.
Create a new index type that has distinct conventions

In general, for an AbstractArray type in Julia, one should neither assume 0 or 1 based conventions nor F or C ordering. While the Array type in Julia does use F-ordering and 1-based indexing, this does not preclude the existence of [other kinds of arrays]() which may not follow those conventions. Many view this flexibility as a defect rather than a feature.

I'm not sure why the focus turned to Julia here since the issue is not necessarily on Julia's limitations. There are a number of ways to address the Julia interface issues. Those probably should be discussed at https://github.com/JuliaIO/Zarr.jl/issues .

Rather the issue here is whether other languages can support column-major storage order or not. The interface and storage issues are separable. An implementation could support a particular kind of interface or there may be multiple implementations for a language, each presenting a distinct interface. There are good arguments for an interface that follow's the language's default conventions as well as one that follows some canonical conventions. To me the question largely depends on the user. The question here, in this pull request, is what do compliant implementations store on disk and if other implementations will be able to read that.

If order stays, then all implementations should figure out how to implement it. Perhaps adopting a permuted dimensions filter (a generalization of transposition to n dimensions) makes the most sense.

jbms commented 2 years ago

Thanks for the clarification and added context, Mark.

One point of clarification: you said that the Julia PyArray function provided by your library provides a Julia-style 1-based F order indexer view. However, it looks like it just shifts to 1-based indexing but does not change the dimension order at all (different from hdf5.jl and zarr.jl, which I believe both shift to 1-based indexing and reverse the dimension order). Is that correct?

In general I think it would be very beneficial to the zarr v3 ecosystem if we can avoid any implementation reversing the dimension order by default, so that there is a simple, unambiguous way to describe the dimensions of a zarr array. Given the capabilities of Julia that you described, do you think that would be a possibility for a Julia zarr v3 implementation? The decision to reverse the dimension order in some c++ and Python n5 implementations has effectively brought major ambiguity regarding dimension order to the n5 ecosystem, and I'd very much like to avoid that in the zarr v3 ecosystem if possible.

rabernat commented 2 years ago

Thanks a lot @mkitti for this helpful perspective from Julia.

Rather the issue here is whether other languages can support column-major storage order or not. The interface and storage issues are separable. An implementation could support a particular kind of interface or there may be multiple implementations for a language, each presenting a distinct interface. There are good arguments for an interface that follow's the language's default conventions as well as one that follows some canonical conventions. To me the question largely depends on the user. The question here, in this pull request, is what do compliant implementations store on disk and if other implementations will be able to read that.

I think I agree with everything you say here. So I think the question for the Zarr spec is whether we need to permit multiple storage orders, or whether we can just get away with just saying "we always use row-major storage order". It's clear that, when reading data, Julia is flexible enough to present this data with whatever order / indexing conventions the user wants. The situation when writing is less clear to me. Given a base Julia array using the default column-major order, how should this be stored in Zarr? Are there significant performance penalties for converting to row-major order for storage?

jstriebel commented 2 years ago

So I think the question for the Zarr spec is whether we need to permit multiple storage orders, or whether we can just get away with just saying "we always use row-major storage order".

:+1:

I'll try to add another perspective: Besides the language, also the available data is important. If data is available in a specific order (before storing it as zarr), it is much more efficient to persist the zarr array with the original order (if one wants to keep the original dimension order). This argument might not hold for frameworks or languages that only support C or F order, but for many it does (e.g. Python with numpy).

alimanfoo commented 2 years ago

Just a brief comment to say that the original motivation for introducing the ability to change the way the data for a chunk is serialised into a contiguous sequence of bytes was because of compression. I have some arrays in my genomics work where compression ratio is much better if chunk data are serialised using F rather than C order, although these are less common. This is usually because you get much longer runs of zeros if you use F rather than C order.

In principle this improvement to compression could be achieved via a "transpose" filter, and always using C order to perform the initial serialisation of chunk data when encoding. However, if you did this it might be then harder for an implementation to figure out when it can skip the transpose operation because it isn't needed, either when writing or reading data. If so, this could introduce more memory copies into the chunk encoding and/or decoding process, which might be tolerable but I generally tried to avoid memory copies where possible for performance reasons.

mkitti commented 2 years ago

One point of clarification: you said that the Julia PyArray function provided by your library provides a Julia-style 1-based F order indexer view. However, it looks like it just shifts to 1-based indexing but does not change the dimension order at all (different from hdf5.jl and zarr.jl, which I believe both shift to 1-based indexing and reverse the dimension order). Is that correct?

The point was to show how the indexing conventions could be flexible in Julia as well as Python. I can see how the message became confused. B, a PyArray, has 3 rows and 5 columns just as does it in numpy. The indexing order is as one would expect for a 3x5 matrix in Julia. However, the underlying memory layout is in C-order. In Julia, the C-order memory layout is show by strides(B) == (5,1). The strides are in terms of the type, in this case a 64-bit float.

Alternatively, I could wrap a normal Julia Array around the pointer underlying the numpy ndarray. In this case, the dimension order is reversed as suggested by the result of strides. In this case, the matrix is represented transposed with 5 rows and 3 columns.

julia> pytype(A)
Python type: <class 'numpy.ndarray'>

julia> A.__array_interface__["data"][0]
Python int: 35843792

julia> ptr = Ptr{Float64}( pyconvert(UInt64, A.__array_interface__["data"][0]) )
Ptr{Float64} @0x000000000222eed0

julia> B_prime = unsafe_wrap(Array, ptr, (5,3))
5×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  1.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> B_prime[2,3]
1.0

julia> strides(B_prime)
(1, 5)

Regarding HDF5, let's save an array with h5py.

In [1]: import h5py, numpy as np

In [2]: E = np.zeros((3,4,5))

In [3]: E[2,3,4] = 1

In [4]: E
Out[4]: 
array([[[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1.]]])

In [5]: with h5py.File("test.h5", "w") as h5f:
   ...:     h5f["data"] = E

Then load it with HDF5.jl.

julia> E = h5open("test.h5") do h5f
           h5f["data"][]
       end
5×4×3 Array{Float64, 3}:
[:, :, 1] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 2] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 3] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0

julia> E[5,4,3]
1.0

julia> h5open("test.h5") do h5f
           h5f["data"][5,4,3]
       end
1.0

Yes, you are correct. The convention of HDF5.jl is to "shift to 1-based indexing and reverse the dimension order". In particular, array data read into memory by HDF5.jl is returned as the normal Array type from the Core module.

mkitti commented 2 years ago

In general I think it would be very beneficial to the zarr v3 ecosystem if we can avoid any implementation reversing the dimension order by default, so that there is a simple, unambiguous way to describe the dimensions of a zarr array. Given the capabilities of Julia that you described, do you think that would be a possibility for a Julia zarr v3 implementation?

If I wanted to take a Core.Array and address the memory as in C, I would do as follows with existing types. Note that the representation changes.

julia> strides(E)
(1, 5, 20)

julia> using OffsetArrays

julia> Z = OffsetArray(PermutedDimsArray(E, [3,2,1]), (-1, -1, -1))
3×4×5 OffsetArray(PermutedDimsArray(::Array{Float64, 3}, (3, 2, 1)), 0:2, 0:3, 0:4) with eltype Float64 with indices 0:2×0:3×0:4:
[:, :, 0] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 1] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 2] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 3] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 4] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0

julia> strides(Z)
(20, 5, 1)

julia> Z[2,3,4]
1.0

julia> typeof(Z)
OffsetArray{Float64, 3, PermutedDimsArray{Float64, 3, (3, 2, 1), (3, 2, 1), Array{Float64, 3}}}

julia> typeof(Z) <: Array
false

julia> typeof(Z) <: AbstractArray
true

As long as you also change the representation, the Julia user should be fine regardless of what's in memory or in disk. The physical layout is rarely exposed directly. Also note that neither OffsetArray or PermutedDimsArray is a subtype of Array. They are subtypes of AbstractArray. The AbstractArray interface does not specify an index order, a dimension order, or 1-based indexing. Non-Array AbstractArrays are quite common, so no assumptions along those lines should be made for methods accepting an AbstractArray.

Should C-order 0-based indexing be the default? For Zarr.jl, that would really be up to @meggart because he has the most code invested based on the current convention. In my opinion, this would be quite difficult. The path to do so would probably involve deprecating most of the current API, adding an explicit indexing parameter to the type and its constructors, and then later making the indexing and representation implicit.

An easier approach for a new default might be to start a Zarr3.jl package. However, it would probably be best to make the permutation of dimensions explicit in that case. A low effort approach would be to wrap zarr-python.

jstriebel commented 1 year ago

@alimanfoo and I propose to proceed with having support for both C and F order in the core specification, mostly to keep backwards-compatibility with v2. Arbitrary memory layout order as proposed in #129 may be added as an extension. Please also see point 16 in https://hackmd.io/@alimanfoo/SJ1DNVRxo.

This still leaves room for changes such as having a transpose codec instead of the current chunk_memory_layout metadata field (as discussed in the comments above), or other modifications and clarifications around this topic.

I'm asking everyone involved in this conversation to signal if this resolution seems acceptable to them to move forward with v3 via :+1:/:-1: on this comment. This discussion goes on since almost 11 months without resolution, so we hope that this might be an acceptable compromise for all participants. cc @meggart @jakirkham @constantinpape @rabernat @jbms @ToucheSir @mkitti @bogovicj

jbms commented 1 year ago

I am okay with deferring arbitrary order support to an extension.

I think that a transpose codec would be more natural than a separate chunk_memory_layout field, since the purpose is for encode/decode efficiency, and that approach would also be more consistent with how we now have an endian codec.

jbms commented 1 year ago

Note: zarr v2 effectively supports encoding orders that are neither C nor F order, when using a structured dtype:

Consider the following .zarray metadata:

{"zarr_format": 2,
 "shape": [100, 200],
  "chunks": [10, 20],
  "dtype": ["a", "<u2", [2, 3]],
  "compressor": null,
  "fill_value": null,
  "filters": [],
  "order": "F"}

Here each chunk has an outer array of shape [10, 20] encoded in F order, where element is an array of shape [2, 3] encoded in C order (because per NumPy, inner arrays of structured dtypes are always encoded in C order). Therefore, each chunk effectively has a shape of [10, 20, 2, 3] with dimension encoding order (outermost to innermost) [1, 0, 2, 3].

jstriebel commented 1 year ago

I think that a transpose codec would be more natural than a separate chunk_memory_layout field, since the purpose is for encode/decode efficiency, and that approach would also be more consistent with how we now have an endian codec.

:+1: , would you like to add a PR for this, @jbms?

jbms commented 1 year ago

I will create a pr.

jstriebel commented 1 year ago

see #189

jstriebel commented 1 year ago

189 was merged and implements the compromise proposed above. Since there were no further objections I'm closing this issue.