zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Relationship of the GeoZarr-spec convention to NetCDF-CF and the CF data model. #14

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

I want to record a dialogue I had with @JonathanGregory for the record. I've included the substance of an email chain here but have removed some superfluous content (Dear Dave, Best Regards, etc.).

TL;DR;

@JonathanGregory points out:

CF started with netCDF but zarr is a different file format. Would it be possible to consider how zarr can represent the CF data model, which is a logical abstraction, rather than the CF-netCDF file format? The data model is Appendix I (letter after H) of the CF convention, in recent versions. A major reason for writing down the data model is to make CF ideas applicable to other file formats. Its introduction says, "The CF data model should be independent of the encoding.

...and...

Because the data model says there's a set of numerical coordinates in a coordinate construct, software which is compliant with the data model ought to have a method which returns that vector of numbers. However, it doesn't have to be stored like that in memory or in the dataset. Behind the scenes it could be origin and offset, with the vector computed upon demand.

...and...

@dblodgett-usgs concludes:

I like Jonathan's suggestion -- we are actually creating a ZARR-CF which doesn't exist yet. Given that, we don't really have to use the NetCDF-CF precedent and can talk about how this new format implements the CF data model including how it handles "compressed" coordinate variables by supporting geotiff style coordinate metadata.

Complete exchange follows.

DB:

Has anyone ever seriously raised the prospect of supporting origin / offset style data in NetCDF-CF? e.g. geotiff coordinates. Going through back issues, I don’t see anything but figured I’d check with you to see if you were aware of anything…

JG:

I can't remember for sure, but I would be very surprised if we had never discussed it in the last 22 years, whether in trac or on the email list, if not in GitHub. I feel a bit nervous about your raising the possibility.

DB:

I don't mean to make you nervous. I'm working with these folks https://github.com/zarr-developers/geozarr-spec to find a workable path to satisfy both remote sensing (multispectral geotiff) and NetCDF-CF data. One avenue that's been suggested is, rather than skirting around CF or doing something outside the context of CF, just proposing support for origin / offset style coordinate variables in CF. Do you have any suggestion for an approach on that?

JG:

Of course you don't, I know. :-) It's because my initial reaction is not to like the idea of including origin and offset in CF. It's less general than explicit coordinates. It's an alternative way to handle a limited case which we can do perfectly well in an existing way.

One could argue for it on the grounds of saving space, and as such it could be considered as a compression method in CF. Space isn't usually an issue for 1D coordinate variables - but perhaps that's the reason for the interest in it?

I'm glad to see that https://github.com/zarr-developers/geozarr-spec adopts some CF ideas. Thanks for being involved with it. The more consistency can be achieved, the better things will be. Is there a reason why zarr can't adopt the CF data model entirely? Maybe it requires some extensions to the data model if there are concepts CF doesn't include?

DB:

We are rapidly approaching the collision of two worlds -- the world that has depended on "raster" grid topologies for ever and the world that has depended on COARDS/CF coordinate constructs forever. Your suggestion that this is about saving space is probably the key driver deep down -- when the coordinates of an array can be collapsed to an origin / offset, it is simply not optimal to carry around all the data needed for a CF coordinate variable. Of course, once you've made the decision to support origin / offset, there are numerous implementation patterns that fall out. As a result, a huge amount of software in the wild has a hard time supporting anything but origin / offset "raster" coordinates. For this reason, there will no doubt be myriad reasons (related to implementation details) people find origin / offset attractive.

Let me see if I can paint a picture of where we find ourselves... The group working on GeoZARR is trying to figure out the right way to approach this problem (how to encode both CF and GeoTIFF data in a geo-enabled ZARR package). NASA people are very interested in this being an OGC standard for contracting reasons. If we went that route, it would really be a lift and shift from CF with some ability to satisfy the needs of remote sensing use cases (there are a few). The issue there is it would not be right to work on a community standard off to the side of CF and publish it through the OGC without some plan to work the capability the standard supports back into the CF convention. So we are pondering if we should actually work on a contribution to the CF convention then introduce something based on that as an OGC community standard down the road a while.

I could see such a thing be developed off to the side, initially as an extension that we could all look at and consider how (or I suppose if) it fits in the convention. It may be that we could define a simple geotiff encoding extension that would sit off to the side and, more or less, be a file that looked like a NetCDF file without coordinate variables.

Anyhow -- I am curious if you have any thoughts on how we might forge ahead on this. There are a lot of threads to weave in here - moving beyond the HDF5 binary encoding for NetCDF among them. This issue of support for geotiff is clearly one of the sticky spots, but it is in the midst of a lot of other factors. No decisions or actual direction has been taken yet though.

JG:

Thank you for your time and raising these issues. It certainly sounds like the right time to address them.

CF started with netCDF but zarr is a different file format. Would it be possible to consider how zarr can represent the CF data model, which is a logical abstraction, rather than the CF-netCDF file format? The data model is Appendix I (letter after H) of the CF convention, in recent versions. A major reason for writing down the data model is to make CF ideas applicable to other file formats. Its introduction says, "The CF data model should be independent of the encoding. This means that it should not be constrained by the parts of the CF conventions which describe explicitly how to store (i.e. encode) metadata in a netCDF file. The virtue of this is that should netCDF ever fail to meet the community needs, the groundwork for applying CF to other file formats will already exist." That sounds like this situation.

The data model comprises various "constructs", including dimension coordinate constructs. The text says, "Dimension coordinate constructs unambiguously describe cell locations for a single domain axis, thus providing independent variables on which the field construct's data depend. A dimension coordinate construct contains numeric coordinates for a single domain axis that are non-missing and strictly monotonically increasing or decreasing. CF-netCDF coordinate variables and numeric scalar coordinate variables correspond to dimension coordinate constructs."

Because the data model says there's a set of numerical coordinates in a coordinate construct, software which is compliant with the data model ought to have a method which returns that vector of numbers. However, it doesn't have to be stored like that in memory or in the dataset. Behind the scenes it could be origin and offset, with the vector computed upon demand. Equally you could provide a method which returned origin and offset for a dimension coordinate construct. If that's how it was stored, that would be easy. If it was stored as a vector, the method would have to analyse it to see whether the numbers had a constant interval; if so, they could be summarised as origin and offset; otherwise, it would be an error. Thus in software you can support both CF-like and "geotiff/raster"-like ways of viewing dimension coordinates.

It wouldn't be necessary to introduce origin and offset to the CF convention if geozarr could focus on the data model instead. Origin and offset could be introduced, in principle, in Sect 8 (for reduction of dataset size) but the argument for that wouldn't be strong, I think, since it requires more complexity in software (for everyone who reads and writes CF-netCDF) in return for very small saving of space, and it breaks the principle about not adding a new way to do something we can already do perfectly well. The argument you are making is more about interoperability, I think, which belongs at a level "above" or "outside" the CF-netCDF convention.

DB:

This is really good. And it makes perfect sense. I was not thinking about this in quite this way. I was thinking of ZARR more like HDF5. If instead, we think of ZARR as a fresh start where we aren't necessarily bound to the core NetCDF4 data model then we really free up a lot of potential.

My only concern is that the relationship back to the CF community will be broken using this model. We would be building an OGC standard or a ZARR community convention - which may be OK, but the intent with this work is to start moving away from HDF5 as the binary encoding of data and have the community move to an open binary encoding rather than the closed HDF group encoding. It sounds like the group will be creating a charter for an OGC specification that will define a ZARR convention which would just be registered in a list of ZARR community conventions but governed through the OGC. Not on it's face a bad thing, but it will certainly have side affects -- some good some bad.

Thanks so much for your advice here -- No matter the outcome, this work will grow from the core and legacy of CF.

DB: to @rabernat, @briannapagan, and @christophenoel

I like Jonathan's suggestion -- we are actually creating a ZARR-CF which doesn't exist yet. Given that, we don't really have to use the NetCDF-CF precedent and can talk about how this new format implements the CF data model including how it handles "compressed" coordinate variables by supporting geotiff style coordinate metadata.

If we are serious about going down the route of having GeoZarr be a full standard in the OGC Two Track system[1] then we would just reference the CF data model [2] and describe how this is an implementation of it.

If this is making sense, then let's get this all written down in the README (I'm happy to alter my PR or have one of you take it over) then start on an OGC charter?

[1] https://docs.opengeospatial.org/pol/05-020r27/05-020r27.html#the-two-track-standards-process-characteristics [2] Hassell, D., Gregory, J., Blower, J., Lawrence, B. N., and Taylor, K. E.: A data model of the Climate and Forecast metadata conventions (CF-1.6) with a software implementation (cf-python v2.1), Geosci. Model Dev., 10, 4619–4646, https://doi.org/10.5194/gmd-10-4619-2017, 2017.

@rabernat:

This is a really interesting perspective. Can you ask Jonathan if we can share his response on GitHub?

DB:

Hi again Jonathan, do you mind if I replicate this thread onto the GeoZarr github for the record? If you'd rather post yourself, we could more or less parrot this thread there?

JG:

Certainly, if it's helpful, please copy it. If there are responses which you'd like to discuss, we can do that.

I don't think this would necessarily break a link between CF and zarr. If zarr can adopt the CF data model, it's no problem at all. If zarr requires things that CF doesn't consider, or wants to exclude some aspect of the CF data model, that shouldn't be a difficulty. Many uses of CF-netCDF do those things e.g. CMIP has a lot of extra conventions that require things to be done a certain way where CF offers various possibilities, and others that require additional attributes that are not part of CF. I should think the same is true for uses of the data model.

zarr development might suggest additions to the CF data model, if there are missing concepts it needs. In the first instance, it would be worth discussing in CF issues whether the concept can be represented in CF in some existing way, rather than with something new. Often this turns out to be possible. So far, the data model has followed to the development of the CF-netCDF convention but it seems reasonable in principle to me that it could happen the other way. A new data model concept could be added for which there is no use-case so far in netCDF, and hence no defined encoding.

The most difficult case would arise if the aims of zarr seem somehow inconsistent with the CF data model, such that zarr would like to change things in CF or do something a different way. We will have to see if that happens. It would take thought and discussion to decide what to do. Within CF, we have often applied the principle that we won't invent a new means to do an old thing, even if it looks nicer. I can see that with a new file format the incentive to forget the old way and adopt a nice new way would be much greater, but I hope the temptation can be resisted. It's much harder work to negotiate and compromise, but it's better for interoperability in the end, I believe! I am in favour of productive dialogue!

rabernat commented 1 year ago

Thanks so much for sharing this exchange David.

I like Jonathan's suggestion -- we are actually creating a ZARR-CF which doesn't exist yet. Given that, we don't really have to use the NetCDF-CF precedent and can talk about how this new format implements the CF data

This basically implies a hierarchy of data models / formats like this

flowchart TD
    CF-Data-Model --> NetCDF
    CF-Data-Model --> Zarr
    NetCDF --> HDF5
    NetCDF --> NetCDF3-Classic

This is a bit in contrast to the current way things have been implemented, which is more like this

flowchart TD
    CF-Data-Model --> NetCDF
    NetCDF --> HDF5
    NetCDF --> NetCDF3-Classic
    NetCDF --> Zarr

It's important to recognize that NetCDF is more than just a file container. It's a data model itself, far simpler and more generic / flexible than CF. There are already lots of applications, most prominently Xarray, that use the NetCDF data model (but not necessarily the full CF data model) on top of Zarr. Not to mention Unidata's NCZarr implementation, which uses the same sort of hierachy (but with an out-of-spec flavor of Zarr).

Deciding which hierarchy we want to pursue seems like a very important design decision.

christophenoel commented 1 year ago

Can you please confirm if the intended meaning of the arrow in this context represents the relationship "A specifies B"?

From my understanding:

Therefore (I'm not claiming that's the case, I'm trying to understand) :

flowchart TD
    NetCDF --> HDF5
    NetCDF --> CF-Data-Model
    GeoZarr--> CF-Data-Model
    GeoZarr--> Zarr
christophenoel commented 1 year ago

(updated)

rabernat commented 1 year ago

The arrow means "sits on top of" or "is a subset of". Currently:

In set notation

$$ \mbox{CF} \subset \mbox{NetCDF4} \subset \mbox{HDF5} $$$

christophenoel commented 1 year ago

f "all CF-compliant datasets adhere to the CF data model," then "GeoZarr" (a CF-compliant dataset) would also conform to the CF data model.

I think that it is essential to differentiate between the CF data model and CF-NetCDF. The CF data model defines a standard set of metadata conventions for describing the content and structure of gridded and ungridded scientific datasets. In contrast, CF-NetCDF refers to the application of these CF conventions specifically to datasets stored in the NetCDF file format.

GeoZarr is an example of a CF-compliant dataset that uses the Zarr storage format instead of NetCDF. Thus, GeoZarr adheres to the CF data model but is stored in a different file format, showcasing the adaptability of the CF conventions across various storage technologies.

flowchart TD
    CF-NetCDF --> NetCDF
    CF-NetCDF --> CF-Data-Model
    NetCDF --> HDF5
    GeoZarr--> CF-Data-Model
    GeoZarr--> Zarr
rabernat commented 1 year ago

The complication here is that netCDF is not just a file format. It's is a data model. The file format for NetCDF can be NetCDF3 (classic), HDF5, or Zarr (NCZarr and various other flavor).

christophenoel commented 1 year ago

Sure, but does this change much ?

flowchart TD
    CF-NetCDF --> NetCDF
    CF-NetCDF --> CF-Data-Model
    NetCDF --> HDF5
    NetCDF --> Others
    GeoZarr--> CF-Data-Model
    GeoZarr--> Zarr
rabernat commented 1 year ago

I see the CF data model as dependent on the NetCDF data model. First sentence of the CF Conventions abstract:

This document describes the CF conventions for climate and forecast metadata designed to promote the processing and sharing of files created with the netCDF Application Programmer Interface [NetCDF].

First paragraph of section 1.1 of the CF conventions ("Goals")

The NetCDF library [NetCDF] is designed to read and write data that has been structured according to well-defined rules and is easily ported across various computer platforms. The netCDF interface enables but does not require the creation of self-describing datasets. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time.

christophenoel commented 1 year ago

From the Introduction section of the CF Conventions document version 1.9 (dated 2021-05-20):

"Although the CF conventions were originally designed for climate and forecast data encoded in the netCDF binary format, they are also applicable to other forms of data storage, including other binary formats and databases. In the following, we use the term 'dataset' to mean a collection of data in any storage format to which the CF conventions can be usefully applied."

Source: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html

christophenoel commented 1 year ago

It seems the source version is wrong... I'm trying to find back the URL.

christophenoel commented 1 year ago

(sorry for the spam)

In the last 3 versions of CF, I could only find:

Although this is specifically a netCDF standard, we feel that most of the ideas are of wider application. The metadata objects could be contained in file formats other than netCDF. Conversion of the metadata between files of different formats will be facilitated if conventions for all formats are based on similar ideas

And also (before the sections for CF-netCDF):

The CF data model should also be independent of the encoding. This means that it should not be constrained by the parts of the CF conventions which describe explicitly how to store (i.e. encode) metadata in a netCDF file. The virtue of this is that should netCDF ever fail to meet the community needs, the groundwork for applying CF to other file formats will already exist.

christophenoel commented 1 year ago

Ok, there is also:

"The elements of the CF data model (Figure I.2, Figure I.3 and Figure I.4) are called "constructs", a term chosen to differentiate from the CF-netCDF elements previously defined and to be programming language-neutral (i.e. as opposed to "object" or "structure"). The constructs, listed in Table I.2, are related to CF-netCDF elements (Figure I.1), which in turn relate to the components of netCDF file.

dblodgett-usgs commented 1 year ago

https://cfconventions.org/ links to all versions of the convention.

The relevant document for the CF Data Model is Annex I of the cf-conventions. https://cfconventions.org/cf-conventions/cf-conventions.html#appendix-CF-data-model which was introduced in CF-1.9

"This appendix contains the explicit data model for CF to provide an interpretation of the conceptual structure of CF which is consistent, comprehensive, and as far as possible independent of the netCDF encoding."

There is a NetCDF data model (https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html) that is independent of CF and has been implemented with numerous binary carriers. (CDL ASCII, NetCDF3 32bit, NetCDF3 64bit, NetCDF4/HDF5 Classic, NetCDF4/HDF5 Extended, NCZarr)

The CF data model uses the term constructs ... " a term chosen to differentiate from the CF-netCDF elements previously defined and to be programming language-neutral (i.e. as opposed to "object" or "structure")."

Constructs are said to have properties rather than variables having attributes

This approach fully decouples the data model from its implementation -- something that has not been done with previous implementations of CF-NetCDF.

We should look at this through a traditional data modelling lens. The CF data model is a conceptual data model that corresponds to a few particular logical and physical data models. What we are doing here (GeoZARR) will be in the form of a logical data model that corresponds to the concepts of the CF data model and will be implemented as a physical data model using ZARR V3.

rabernat commented 1 year ago

Thanks for weighing in David. Can you clarify one thing: is the CF data model a subset of the NetCDF data model? I.e. all data compliant with the CF data model is also compliant with the NetCDF data model?

dblodgett-usgs commented 1 year ago

No. The CF data model is abstract as is the NetCDF data model. So they are actually kind of mutually exclusive.

In terms of the conceptual/logical/physical data model tiers, I see NetCDF (without any modifiers) as the conceptual data model that hinges on dimensions, variables, attributes and in its extended form, groups.

If we think of both the CF data model and the NetCDF data model as providing conceptual building blocks, then CF-NetCDF is a logical data model that combines the two in a specific set of conventions. Those conventions have been implemented in a range of physical data models that include NetCDF in their name, which just confuses everyone.

What we are going to do may actually draw on the concepts of both the CF data model and the NetCDF data model but will be an independent logical data model from CF-NetCDF. This will free us up to implement things that are incompatible with the particulars of CF-NetCDF while maintaining compatibility as we will build on the same core concepts. ... I think, I'm not being particularly careful with my response here.

rabernat commented 1 year ago

No. The CF data model is abstract as is the NetCDF data model. So they are actually kind of mutually exclusive.

Ok. This is definitely challenging the way I have been thinking about these concepts for years. I don't understand how they can be mutually exclusive, since all netCDF files are by construction compatible with the netCDF data model, and we store CF-compliant data in netCDF files. It will take me some time to process what you have said.

christophenoel commented 1 year ago

@rabernat , @dblodgett-usgs : Thank you for bringing the whole discussion. I understand much more.

dblodgett-usgs commented 1 year ago

Let me write out some more for instances here...

I'm just using conceptual / logical / physical as an analysis tool here and recognizing the quite recent work to abstract the CF Data Model out of the CF-NetCDF convention. (https://doi.org/10.5194/gmd-10-4619-2017)

So as of CF-1.9, the CF-NetCDF convention is an implementation of the NetCDF data model and the CF data model.

So take a typical NetCDF4-Classic file that advertises CF-1.9 conventions:

The physical data model of that file is HDF5-NetCDF4-Classic with CF-NetCDF conventions

The logical data model of that file is NetCDF-Classic with CF-NetCDF conventions

The conceptual data model of that file is the NetCDF Data Model and the CF Data Model

In GeoZarr: (work in progress)

The physical data model of that file is ZARR V3 with GeoZarr V* conventions

The logical data model of that file is NetCDF-ZARR with GeoZarr conventions

The conceptual data model of that file is the NetCDF Data Model with CF Data Model

dblodgett-usgs commented 1 year ago

I think this conversation is mostly closed. The discussion has largely circled back to "we are basing this work more or less on the CF baseline but not being overly formal."

I think we are, in essence building something based on the CF Data Model and the NetCDF data model but not necessarily planning on having complete compatibility with the CF-NetCDF conventions. We should put those issues aside though and focus on some of the key use cases and how to get them implemented in xarray and gdal.

rabernat commented 1 year ago

get them implemented in xarray and gdal.

Should we try to get R in here too?

dblodgett-usgs commented 1 year ago

Oh probably -- most of the R ecosystem is going in the direction of using GDAL though.

rabernat commented 1 year ago

most of the R ecosystem is going in the direction of using GDAL though.

Even for climate data? Like, you would open CMIP6 data using GDAL?

dblodgett-usgs commented 1 year ago

That does seem to be the direction things are heading. Especially with the advent of the Multidimensional Array API. https://gdal.org/api/index.html#multi-dimensional-array-api

There are some alternatives that use the RNetCDF wrapper on the NetCDF-C API but it seems that the general consensus is that the logic in the GDAL package is strong enough that carrying around the GDAL installation is less problematic than needing to maintain an R package that has all that capability.

@edzer may have a bit more nuanced opinion though?

I would be willing to help with a base R zarr reader package that more or less paralleled the RNetCDF function signatures so we had an equivalent without the NetCDF-C API as a dependency, but it would be a fairly major undertaking.

This appears to be the furthest along base R implementation.

https://github.com/keller-mark/pizzarr

edzer commented 1 year ago

@edzer may have a bit more nuanced opinion though?

I have made good experience with the GDAL multidimensional array API, reported here. This API is relatively new, though, and maybe not used on a wide variety of Zarr files. I would be happy to see experiments reading Zarr files building on R packages that only use the NetCDF (NCZarr?) interface, if only to cross check.

We successfully convinced the CRAN maintainters to include the BLOSC library in the packaging systems for binary (macos & windows) package distros (these come with a complete statically linked copy of GDAL + all its dependencies).

The main advantage of GDAL IMO is that it can read and write the coordinate reference system in a way the geo world is used to.

rabernat commented 1 year ago

This is so interesting to me. Never in a million years would we consider using GDAL to read NetCDF files for climate data analysis from python world. I've never heard of anyone ever doing that.

edzer commented 1 year ago

I wouldn't recommend doing it using the raster or vector API, but the multidimensional array API (and of course the spatial reference system API) are good at this!

dblodgett-usgs commented 1 year ago

Agreed. It seems odd. People working in R specifically on climate data would be in the same boat, but anyone who wants to integrate with geospatial data end up having to implement the integration themselves. Classically, this has been done through the NetCDF-C API with RNetCDF (which mirrors the NetCDF-C API) and ncdf4 (which implements an opinionated approach to reading CF-like NetCDF data)

I implemented the direct NetCDF adapter for stars (https://r-spatial.github.io/stars/reference/read_ncdf.html) as well as some of the adapter components in nc-meta (https://hypertidy.github.io/ncmeta/) that allow you to go from CF-NetCDF to in-memory geospatially-referenced data. Trick is -- the multidimensional GDAL API does the same thing and is as good or better. So as long as it's not a blocker to get GDAL into an environment, there's very little argument to maintain an alternate implementation in base R.