Open christophenoel opened 5 months ago
The NCZarr convention link above is not the most current version. The most current version is in the netCDF-C docs at this very ugly URL [1].
The main difference is the change to storing NCZarr specific information as extra keys within the Zarr JSON objects (e.g. _nczarr_array
in .zarray
) instead of the earlier use of non-Zarr JSON objects (like .nczarray
and .nczattr
).
[1] Sorry for the multiple versions and ugly URL, we are working our way through a big clean-up/reorganization of our netCDF documentation.
Text edited.
As reported by @ethanrd and agreed, we aim to align GeoZarr terminology whenever possible with CF terminology which itself relies heavily on NetCDF User Guide.
About dataset
A netCDF dataset contains dimensions, variables, and attributes, which all have both a name and an ID number by which they are identified. (not found a formal definition of dataset)
About group
Groups, like directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables. Each group acts as an entire netCDF dataset in the classic model. That is, each group may have attributes, dimensions, and variables, as well as other groups. The default group is the root group, which allows the classic netCDF data model to fit neatly into the new model.
About dimensions
A dimension may be used to represent a real physical dimension, for example, time, latitude, longitude, or height. A dimension might also be used to index other quantities, for example station or model-run-number. A netCDF dimension has both a name and a length.
About variables
Variables are used to store the bulk of the data in a netCDF dataset. A variable represents an array of values of the same type. A scalar value is treated as a 0-dimensional array. A variable has a name, a data type, and a shape described by its list of dimensions specified when the variable is created. A variable may also have associated attributes, which may be added, deleted or changed after the variable is created.
About coordinate variables
A variable with the same name as a dimension is called a coordinate variable. It typically defines a physical coordinate corresponding to that dimension. The above CDL example includes the coordinate variables lat, lon, level and time, defined as follows:
About attributes
NetCDF attributes are used to store data about the data (ancillary data or metadata), similar in many ways to the information stored in data dictionaries and schema in conventional database systems. Most attributes provide information about a specific variable. These are identified by the name (or ID) of that variable, together with the name of the attribute. Some attributes provide information about the dataset as a whole and are called global attributes. These are identified by the attribute name together with a blank variable name (in CDL) or a special null "global variable" ID (in C or Fortran). In netCDF-4 file, attributes can also be added at the group level.
auxiliary coordinate variable
Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).
coordinate variable
We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time)], and it is defined as a numeric data type with values in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables.
I'm a little confused by:
- 📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (
.zattrs
) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata
).
I think the .zmetadata
is just a consolidated copy of all the metadata in all the .zattrs
and .zarray
files with the top level .zgroup
metadata. That's been my experience with the zarr.convenience.consolidate_metadata
function and that's what the documentation says. So the .zmetadata
file does show you the structure of all the metadata but that's only because it just has all the metadata.
@christine-e-smit Absolutely, the .zmetadata indeed consolidates all metadata for groups and arrays within the specified store into a singular resource.
This statement in the definition doesn't contradict but rather implies that having this consolidated metadata at the dataset level is mandatory, allowing libraries (like xarray) to understand the structure without needing to read each object individually.
Improvement:
📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata) through consolidated metadata.
I have not made so much progress, but I would like to share some thoughts about the concept of dataset (coming from xarray, itself based on NetCDF).
The GeoZarr specification must balance two key objectives:
For this reason, I think that providing requirements around Dataset (group with coordinates and variables) is essential. It identifies a minimal Zarr structure for interpreting a set of raster variables while still allowing (not excluding) other types of data (e.g., secondary,auxiliary data, point clouds, ...) in other Zarr groups.
For example the conformance class "http://www.opengis.net/spec/ogc-geozarr/1.0/conf/dataset" might include a requirement that defines the minimal aspect that are expected by a client. Following xarray encoding of NetCDF:
Requirement 1 | /req/core/dataset |
---|---|
A | A GeoZarr may include a GeoZarr dataset at the root Zarr Group level or any children level. |
B | A GeoZarr dataset must include the coordinates in children Zarr arrays. |
C | A GeoZarr dataset must include the variables in children Zarr arrays. |
D | A GeoZarr dataset must include only variables sharing the same coordinates |
The relationship with metadata (which is key in Cloud native geospatial), is that I expect a STAC Item/STAC Collection to define asset objects (links) for each dataset, indicating a dedicated dataset media type that informs the client it can be easily displayed on a map, or used in a Jupyter Notebook.
--- Reminder ---
📂 GeoZarr Dataset: is a collection of EO data arrays (one or more) that represents information about a measured or observed geospatial phenomena capture at one or more locations and times. It can encompass various formats and types of data, such as granules (individual data points or images), geospatial time series (3D datasets capturing changes over time), or hyperspectral data (capturing a wide spectrum of light beyond visible light for each pixel).
📦 GeoZarr Group, like Zarr Group, acting as directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables.Each group can have attributes, dimensions, variables, and other nested groups. A GeoZarr Group may acts as a Dataset and contain multiple Dataset children groups.
ℹ️ Edit: This post has been updated to more accurately capture my original message's intent
One of the foundational steps in developing GeoZarr specifications should involve detailing its organizational structure (typically based on the Zarr objects). The initial version of GeoZarr outlines the GeoZarr Classes but doesn't detail the data model storage strucure and format. GeoZarr conventions rely on XArray (including its terminology which borrows from CF conventions) which itself does not document explicitly the format.
Implicit structure of GeoZarr/ xArray Zarr
GeoZarr organizes data in a way that is compatible with the structure of Zarr. This structure should be clearly defined, similar to how it is done in the documentation for NCZarr.
Example of structure:
Here’s a simplified breakdown of how GeoZarr organizes its data, using XArray concepts as a foundation:
.zattrs
) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata
).Dataset .zattrs
.zarray
). Additionally, it holds geospatial information (e.g., type of observation, units, CF conventions) in another metadata file (.zattrs), including:_ARRAY_DIMENSIONS
: provides the name of dimensions (which siblings provides the coordinates)grid_mapping
: mapping of data to geographical projections (based on CF)Data Array .zattrs
.zattrs
) specifying CF attributes and the dimensions it relates to (ensuring it matches the size of the dimensions of the data arrays it references).An explicit explanation of how coordinates work within the GeoZarr context—especially their interaction with data arrays and how they enable spatial indexing could provide clarity.
Coordinate .zattrs
Structure Overview
With SMOS dataset example:
Structure Specification
🔍 The new structure might differ from XArray's typical approach. For example, the following changes may be considered :
Original (geo) Zarr discussions
The following old discussions related to the conventions initally created by xarray, NCZarr, etc. may help:
Early draft data model structure spec
🚧 List of statements to be assessed, improved and agreed:
Definitions of core elements:
Structure of Dataset:
zgeo
set toDataset
.Structure of DataArray:
zgeo
set toDataArray
.. TBD: exact list of recommended CF attributes ❓_ARRAY_DIMENSIONS
shall provide the name of dimensions coordinates (which siblings provides the coordinates) as defined in the Dataset indexes. ❓Structure of Coordinate:
zgeo
set toCoordinate
.