zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Specifying the Organizational Structure of GeoZarr #34

Open christophenoel opened 5 months ago

christophenoel commented 5 months ago

ℹ️ Edit: This post has been updated to more accurately capture my original message's intent

One of the foundational steps in developing GeoZarr specifications should involve detailing its organizational structure (typically based on the Zarr objects). The initial version of GeoZarr outlines the GeoZarr Classes but doesn't detail the data model storage strucure and format. GeoZarr conventions rely on XArray (including its terminology which borrows from CF conventions) which itself does not document explicitly the format.

Implicit structure of GeoZarr/ xArray Zarr

GeoZarr organizes data in a way that is compatible with the structure of Zarr. This structure should be clearly defined, similar to how it is done in the documentation for NCZarr.

Example of structure:

SMOS.zarr/
├── .zgroup
├── .zattrs
├── .zmetadata
├── sea_ice_thickness/
│   ├── .zarray
│   └── .zattrs
├── time/
│   ├── .zxarray
│   └── .zattrs
├── x/
└── y/

Here’s a simplified breakdown of how GeoZarr organizes its data, using XArray concepts as a foundation:

Dataset .zattrs

{
    "date_created": "Mon Dec 12 09:29:59 2022",
    "grid": "NSIDC polar stereographic projection. https://nsidc.org/data/polar-stereo/ps_grids.html ",
    "institution": "Alfred-Wegener-Institut Helmholtz Zentrum (AWI)",
    "platform": "ESA Soil Moisture and Ocean Salinity (SMOS) mission",
    "processing_level": "l3c",
    "product_version": "3.3",
}

Data Array .zattrs

{
    "_ARRAY_DIMENSIONS": ["time","y","x"],
    "coordinates": "longitude latitude",
    "long_name": "SMOS sea ice thickness",
    "standard_name": "sea ice thickness",
    "units": "m"
    "grid_mapping": [...]
}

An explicit explanation of how coordinates work within the GeoZarr context—especially their interaction with data arrays and how they enable spatial indexing could provide clarity.

Coordinate .zattrs

{
    "_ARRAY_DIMENSIONS": [
        "x"
    ],
    "grid_spacing": "12.5 km",
    "long_name": "x coordinate of projection",
    "standard_name": "projection_x_coordinate",
    "units": "km"

Structure Overview

With SMOS dataset example:

image

Structure Specification

🔍 The new structure might differ from XArray's typical approach. For example, the following changes may be considered :

Original (geo) Zarr discussions

The following old discussions related to the conventions initally created by xarray, NCZarr, etc. may help:

Early draft data model structure spec

🚧 List of statements to be assessed, improved and agreed:

Definitions of core elements:

Structure of Dataset:

Structure of DataArray:

Structure of Coordinate:

ethanrd commented 4 months ago

The NCZarr convention link above is not the most current version. The most current version is in the netCDF-C docs at this very ugly URL [1].

The main difference is the change to storing NCZarr specific information as extra keys within the Zarr JSON objects (e.g. _nczarr_array in .zarray) instead of the earlier use of non-Zarr JSON objects (like .nczarray and .nczattr).

[1] Sorry for the multiple versions and ugly URL, we are working our way through a big clean-up/reorganization of our netCDF documentation.

christophenoel commented 3 months ago

Text edited.

christophenoel commented 3 months ago

As reported by @ethanrd and agreed, we aim to align GeoZarr terminology whenever possible with CF terminology which itself relies heavily on NetCDF User Guide.

NetCDF

About dataset

A netCDF dataset contains dimensions, variables, and attributes, which all have both a name and an ID number by which they are identified. (not found a formal definition of dataset)

About group

Groups, like directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables. Each group acts as an entire netCDF dataset in the classic model. That is, each group may have attributes, dimensions, and variables, as well as other groups. The default group is the root group, which allows the classic netCDF data model to fit neatly into the new model.

About dimensions

A dimension may be used to represent a real physical dimension, for example, time, latitude, longitude, or height. A dimension might also be used to index other quantities, for example station or model-run-number. A netCDF dimension has both a name and a length.

About variables

Variables are used to store the bulk of the data in a netCDF dataset. A variable represents an array of values of the same type. A scalar value is treated as a 0-dimensional array. A variable has a name, a data type, and a shape described by its list of dimensions specified when the variable is created. A variable may also have associated attributes, which may be added, deleted or changed after the variable is created.

About coordinate variables

A variable with the same name as a dimension is called a coordinate variable. It typically defines a physical coordinate corresponding to that dimension. The above CDL example includes the coordinate variables lat, lon, level and time, defined as follows:

About attributes

NetCDF attributes are used to store data about the data (ancillary data or metadata), similar in many ways to the information stored in data dictionaries and schema in conventional database systems. Most attributes provide information about a specific variable. These are identified by the name (or ID) of that variable, together with the name of the attribute. Some attributes provide information about the dataset as a whole and are called global attributes. These are identified by the attribute name together with a blank variable name (in CDL) or a special null "global variable" ID (in C or Fortran). In netCDF-4 file, attributes can also be added at the group level.

CF definitions

auxiliary coordinate variable

Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

coordinate variable

We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time)], and it is defined as a numeric data type with values in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables.

christine-e-smit commented 3 months ago

I'm a little confused by:

  • 📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata).

I think the .zmetadata is just a consolidated copy of all the metadata in all the .zattrs and .zarray files with the top level .zgroup metadata. That's been my experience with the zarr.convenience.consolidate_metadata function and that's what the documentation says. So the .zmetadata file does show you the structure of all the metadata but that's only because it just has all the metadata.

christophenoel commented 3 months ago

@christine-e-smit Absolutely, the .zmetadata indeed consolidates all metadata for groups and arrays within the specified store into a singular resource.

This statement in the definition doesn't contradict but rather implies that having this consolidated metadata at the dataset level is mandatory, allowing libraries (like xarray) to understand the structure without needing to read each object individually.

christophenoel commented 3 months ago

Improvement:

📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata) through consolidated metadata.

christophenoel commented 1 month ago

I have not made so much progress, but I would like to share some thoughts about the concept of dataset (coming from xarray, itself based on NetCDF).

The GeoZarr specification must balance two key objectives:

For this reason, I think that providing requirements around Dataset (group with coordinates and variables) is essential. It identifies a minimal Zarr structure for interpreting a set of raster variables while still allowing (not excluding) other types of data (e.g., secondary,auxiliary data, point clouds, ...) in other Zarr groups.

For example the conformance class "http://www.opengis.net/spec/ogc-geozarr/1.0/conf/dataset" might include a requirement that defines the minimal aspect that are expected by a client. Following xarray encoding of NetCDF:

Requirement 1 /req/core/dataset
A A GeoZarr may include a GeoZarr dataset at the root Zarr Group level or any children level.
B A GeoZarr dataset must include the coordinates in children Zarr arrays.
C A GeoZarr dataset must include the variables in children Zarr arrays.
D A GeoZarr dataset must include only variables sharing the same coordinates

The relationship with metadata (which is key in Cloud native geospatial), is that I expect a STAC Item/STAC Collection to define asset objects (links) for each dataset, indicating a dedicated dataset media type that informs the client it can be easily displayed on a map, or used in a Jupyter Notebook.

--- Reminder ---

📂 GeoZarr Dataset: is a collection of EO data arrays (one or more) that represents information about a measured or observed geospatial phenomena capture at one or more locations and times. It can encompass various formats and types of data, such as granules (individual data points or images), geospatial time series (3D datasets capturing changes over time), or hyperspectral data (capturing a wide spectrum of light beyond visible light for each pixel).

📦 GeoZarr Group, like Zarr Group, acting as directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables.Each group can have attributes, dimensions, variables, and other nested groups. A GeoZarr Group may acts as a Dataset and contain multiple Dataset children groups.