Open DennisHeimbigner opened 1 year ago
I'm not sure I understand what you're asking.
A data type extension just means adding support to the spec or implementation for a new data type. It is referenced in the metadata by name, just like any existing data type
Sorry, let me try to clearer. Consider this example from the spec:
"data_type": {
"name": "datetime",
"configuration": {
"unit": "ns"
}
},
which is defined in the metadata for an array. The question is: suppose I want to have two arrays of type "datetime", Do I need to repeat the above declaration in each array?
The question is: suppose I want to have two arrays of type "datetime", Do I need to repeat the above declaration in each array?
Yes, because in principle the arrays could have different data types. How else should the data type of an array be specified, if not in the metadata for the array?
But does each reference to "datetime" need to include the "configuration" or is can just the name "datetime" be used; similar to how I can just say "float64"?
datetime hasn't been standardized so it is kind of hypothetical at this point. But the intent of configuration is to allow parameterized data types without having to encode all of the parameters as a single string. Until there actually are any parameterized data types, though, you might just ignore the possibility in your implementation. That is what I've done in tensorstore.
The specific example -- datetime -- is irrelevant, but I was planning in adding a type "char" as an extension in support of NCZarr, so I have an interest in this issue.
Is the idea that char
will be a fixed-length string, and you would then use a configuration option to indicate the length in bytes, e.g. "data_type": {"name": "char", "configuration": {"length": 10}}
?
See also https://github.com/zarr-developers/zeps/pull/47 regarding a variable-length string proposal.
No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding, typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type for NCZarr.
No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding, typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type for NCZarr.
I see, in that case you would not need any configuration options, unless you want to use a configuration option to indicate the character encoding.
Yes, with the option of specifying the encoding.
I think there is in general a question as to what should go into the data type configuration and what should go into a separate attribute to indicate "units".
For example, if we are storing a temperature in degrees C, we would probably not have a separate "degrees C stored as float64" data type. Instead, we would store it with a data type of "float64" and use some other attribute to indicate that the unit is "degrees C".
For datetime, a unit rather than a separate data type would also seem to me to be cleaner, but many data storage formats, including zarr v2 as implemented by zarr-python, do have a separate data type for datetime.
We seem to have strayed from my original question, namely, does the full datatype definition, including "configuration" need to be included in every array in which it is used? I gather from the above discussion that there is currently no answer to this question, and an answer must wait until there is an actual use case.
If the data type has no configuration options, then it can be specified as a plain string, {"data_type": "float64", ...}
. If the data type has configuration options, like {"data_type": {"name": "char", "configuration": {"encoding": "ascii"}}, ...}
, then indeed those configuration options must be specified each time the data type is specified. There isn't any way to specify "default" configuration options for a data type, like saying that any time we reference char
we should assume a configuration of {"encoding": "ascii"}
.
An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then allow arrays to reference it by name with out the configuration. The problem with repeating the whole type declaration repeatedly is well-known in programming language semantics as the structural type equality problem. Consider two type declarations in two different arrays. Just because the declarations look identical is no reason to to assume they refer to the same type.
An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then allow arrays to reference it by name with out the configuration. The problem with repeating the whole type declaration repeatedly is well-known in programming language semantics as the structural type equality problem. Consider two type declarations in two different arrays. Just because the declarations look identical is no reason to to assume they refer to the same type.
For the cases we've discussed so far, char
and datetime
, I don't think nominal vs structural typing would be an issue.
However, I can see that if we had a data type representing a record type containing multiple fields, that this might be more of an issue. In some ways the idea of nominal vs structural typing is also related to the issue of units --- you could imagine storing a "units" attribute for this record
-typed array that indicates some unique identifier for the record type.
In general I think nominal typing is problematic because a given program may be working with arrays from more than one group. For example, suppose you are copying an array from local disk storage to s3. If the local disk array somehow references a data type defined in its parent group, and you compare data types nominally rather than structurally, there would be no way for the array stored on s3 to specify the same data type.
Sure there is. The standard approach is to use fully qualified names (FQNs).
Sure there is. The standard approach is to use fully qualified names (FQNs).
Can you give an example of what you have in mind?
I think we run into a problem if the "fully-qualified name" is both the unique identifier and the location of the definition. If the "fully-qualified name" is independent of the location of the definition, or the definition is always provided inline with the fully-qualified name, then it seems fine.
I don't follow your last paragraph. But I would do something like this:
I don't follow your last paragraph. But I would do something like this:
- suppose we have group /g and /g/h
- in /g/zarr.info, we define a type {"name": "T, "configuration": {"param1": value1, "param2": value2}
- We could have the following array /g/i with {..., "data_type": "T"...}
- and another array /g/h/j with {..., "data_type": "/g/T"...} If there is a problem with using e.g. /g/T, then we could use another path separator besides '/'
Can you clarify how the name works? Using the datetime example, would T
be "datetime"? But we might want to define two different datetime data types, e.g. one stored as second and the other as milliseconds. Do we need a separate group in order to have a different datetime data type?
Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.
Yes, T would be e.g. datetime. If you want two types with different precision, you would have to have two types, datetime_sec and datetime_ms, for example. But this seems to me analogous to float32 vs float64. For your second comment, In standard usage, the FQN is always relative to the root group.
Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.
I misread this comment. My assumption has always been that the metadata for a given file was self contained so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file described somewhere?
Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.
I misread this comment. My assumption has always been that the metadata for a given file was self contained so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file described somewhere?
Currently there isn't really any type of "reference" to other arrays/groups/attributes/metadata fields of any kind anywhere in the spec.
Interesting. That validates my belief that it is intended that array declarations are intended to be self contained. In any case, yes, I am proposing references to other semantic objects in the metadata tree of a file.
It appears that a data type extension must be re-defined at every use (specifically in each array's metadata). It would certainly be useful if a data type extension could be defined once somewhere and used where needed.