Closed dblodgett-usgs closed 1 year ago
Thanks so much for sharing this exchange David.
I like Jonathan's suggestion -- we are actually creating a ZARR-CF which doesn't exist yet. Given that, we don't really have to use the NetCDF-CF precedent and can talk about how this new format implements the CF data
This basically implies a hierarchy of data models / formats like this
flowchart TD
CF-Data-Model --> NetCDF
CF-Data-Model --> Zarr
NetCDF --> HDF5
NetCDF --> NetCDF3-Classic
This is a bit in contrast to the current way things have been implemented, which is more like this
flowchart TD
CF-Data-Model --> NetCDF
NetCDF --> HDF5
NetCDF --> NetCDF3-Classic
NetCDF --> Zarr
It's important to recognize that NetCDF is more than just a file container. It's a data model itself, far simpler and more generic / flexible than CF. There are already lots of applications, most prominently Xarray, that use the NetCDF data model (but not necessarily the full CF data model) on top of Zarr. Not to mention Unidata's NCZarr implementation, which uses the same sort of hierachy (but with an out-of-spec flavor of Zarr).
Deciding which hierarchy we want to pursue seems like a very important design decision.
Can you please confirm if the intended meaning of the arrow in this context represents the relationship "A specifies B"?
From my understanding:
Therefore (I'm not claiming that's the case, I'm trying to understand) :
flowchart TD
NetCDF --> HDF5
NetCDF --> CF-Data-Model
GeoZarr--> CF-Data-Model
GeoZarr--> Zarr
(updated)
The arrow means "sits on top of" or "is a subset of". Currently:
In set notation
$$ \mbox{CF} \subset \mbox{NetCDF4} \subset \mbox{HDF5} $$$
f "all CF-compliant datasets adhere to the CF data model," then "GeoZarr" (a CF-compliant dataset) would also conform to the CF data model.
I think that it is essential to differentiate between the CF data model and CF-NetCDF. The CF data model defines a standard set of metadata conventions for describing the content and structure of gridded and ungridded scientific datasets. In contrast, CF-NetCDF refers to the application of these CF conventions specifically to datasets stored in the NetCDF file format.
GeoZarr is an example of a CF-compliant dataset that uses the Zarr storage format instead of NetCDF. Thus, GeoZarr adheres to the CF data model but is stored in a different file format, showcasing the adaptability of the CF conventions across various storage technologies.
flowchart TD
CF-NetCDF --> NetCDF
CF-NetCDF --> CF-Data-Model
NetCDF --> HDF5
GeoZarr--> CF-Data-Model
GeoZarr--> Zarr
The complication here is that netCDF is not just a file format. It's is a data model. The file format for NetCDF can be NetCDF3 (classic), HDF5, or Zarr (NCZarr and various other flavor).
Sure, but does this change much ?
flowchart TD
CF-NetCDF --> NetCDF
CF-NetCDF --> CF-Data-Model
NetCDF --> HDF5
NetCDF --> Others
GeoZarr--> CF-Data-Model
GeoZarr--> Zarr
I see the CF data model as dependent on the NetCDF data model. First sentence of the CF Conventions abstract:
This document describes the CF conventions for climate and forecast metadata designed to promote the processing and sharing of files created with the netCDF Application Programmer Interface [NetCDF].
First paragraph of section 1.1 of the CF conventions ("Goals")
The NetCDF library [NetCDF] is designed to read and write data that has been structured according to well-defined rules and is easily ported across various computer platforms. The netCDF interface enables but does not require the creation of self-describing datasets. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time.
From the Introduction section of the CF Conventions document version 1.9 (dated 2021-05-20):
"Although the CF conventions were originally designed for climate and forecast data encoded in the netCDF binary format, they are also applicable to other forms of data storage, including other binary formats and databases. In the following, we use the term 'dataset' to mean a collection of data in any storage format to which the CF conventions can be usefully applied."
Source: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html
It seems the source version is wrong... I'm trying to find back the URL.
(sorry for the spam)
In the last 3 versions of CF, I could only find:
Although this is specifically a netCDF standard, we feel that most of the ideas are of wider application. The metadata objects could be contained in file formats other than netCDF. Conversion of the metadata between files of different formats will be facilitated if conventions for all formats are based on similar ideas
And also (before the sections for CF-netCDF):
The CF data model should also be independent of the encoding. This means that it should not be constrained by the parts of the CF conventions which describe explicitly how to store (i.e. encode) metadata in a netCDF file. The virtue of this is that should netCDF ever fail to meet the community needs, the groundwork for applying CF to other file formats will already exist.
Ok, there is also:
"The elements of the CF data model (Figure I.2, Figure I.3 and Figure I.4) are called "constructs", a term chosen to differentiate from the CF-netCDF elements previously defined and to be programming language-neutral (i.e. as opposed to "object" or "structure"). The constructs, listed in Table I.2, are related to CF-netCDF elements (Figure I.1), which in turn relate to the components of netCDF file.
https://cfconventions.org/ links to all versions of the convention.
The relevant document for the CF Data Model is Annex I of the cf-conventions. https://cfconventions.org/cf-conventions/cf-conventions.html#appendix-CF-data-model which was introduced in CF-1.9
"This appendix contains the explicit data model for CF to provide an interpretation of the conceptual structure of CF which is consistent, comprehensive, and as far as possible independent of the netCDF encoding."
There is a NetCDF data model (https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html) that is independent of CF and has been implemented with numerous binary carriers. (CDL ASCII, NetCDF3 32bit, NetCDF3 64bit, NetCDF4/HDF5 Classic, NetCDF4/HDF5 Extended, NCZarr)
The CF data model uses the term constructs ... " a term chosen to differentiate from the CF-netCDF elements previously defined and to be programming language-neutral (i.e. as opposed to "object" or "structure")."
Constructs are said to have properties rather than variables having attributes
This approach fully decouples the data model from its implementation -- something that has not been done with previous implementations of CF-NetCDF.
We should look at this through a traditional data modelling lens. The CF data model is a conceptual data model that corresponds to a few particular logical and physical data models. What we are doing here (GeoZARR) will be in the form of a logical data model that corresponds to the concepts of the CF data model and will be implemented as a physical data model using ZARR V3.
Thanks for weighing in David. Can you clarify one thing: is the CF data model a subset of the NetCDF data model? I.e. all data compliant with the CF data model is also compliant with the NetCDF data model?
No. The CF data model is abstract as is the NetCDF data model. So they are actually kind of mutually exclusive.
In terms of the conceptual/logical/physical data model tiers, I see NetCDF (without any modifiers) as the conceptual data model that hinges on dimensions, variables, attributes and in its extended form, groups.
If we think of both the CF data model and the NetCDF data model as providing conceptual building blocks, then CF-NetCDF is a logical data model that combines the two in a specific set of conventions. Those conventions have been implemented in a range of physical data models that include NetCDF in their name, which just confuses everyone.
What we are going to do may actually draw on the concepts of both the CF data model and the NetCDF data model but will be an independent logical data model from CF-NetCDF. This will free us up to implement things that are incompatible with the particulars of CF-NetCDF while maintaining compatibility as we will build on the same core concepts. ... I think, I'm not being particularly careful with my response here.
No. The CF data model is abstract as is the NetCDF data model. So they are actually kind of mutually exclusive.
Ok. This is definitely challenging the way I have been thinking about these concepts for years. I don't understand how they can be mutually exclusive, since all netCDF files are by construction compatible with the netCDF data model, and we store CF-compliant data in netCDF files. It will take me some time to process what you have said.
@rabernat , @dblodgett-usgs : Thank you for bringing the whole discussion. I understand much more.
Let me write out some more for instances here...
I'm just using conceptual / logical / physical as an analysis tool here and recognizing the quite recent work to abstract the CF Data Model out of the CF-NetCDF convention. (https://doi.org/10.5194/gmd-10-4619-2017)
So as of CF-1.9, the CF-NetCDF convention is an implementation of the NetCDF data model and the CF data model.
So take a typical NetCDF4-Classic file that advertises CF-1.9 conventions:
The physical data model of that file is HDF5-NetCDF4-Classic with CF-NetCDF conventions
The logical data model of that file is NetCDF-Classic with CF-NetCDF conventions
The conceptual data model of that file is the NetCDF Data Model and the CF Data Model
In GeoZarr: (work in progress)
The physical data model of that file is ZARR V3 with GeoZarr V* conventions
The logical data model of that file is NetCDF-ZARR with GeoZarr conventions
The conceptual data model of that file is the NetCDF Data Model with CF Data Model
I think this conversation is mostly closed. The discussion has largely circled back to "we are basing this work more or less on the CF baseline but not being overly formal."
I think we are, in essence building something based on the CF Data Model and the NetCDF data model but not necessarily planning on having complete compatibility with the CF-NetCDF conventions. We should put those issues aside though and focus on some of the key use cases and how to get them implemented in xarray and gdal.
get them implemented in xarray and gdal.
Should we try to get R in here too?
Oh probably -- most of the R ecosystem is going in the direction of using GDAL though.
most of the R ecosystem is going in the direction of using GDAL though.
Even for climate data? Like, you would open CMIP6 data using GDAL?
That does seem to be the direction things are heading. Especially with the advent of the Multidimensional Array API. https://gdal.org/api/index.html#multi-dimensional-array-api
There are some alternatives that use the RNetCDF wrapper on the NetCDF-C API but it seems that the general consensus is that the logic in the GDAL package is strong enough that carrying around the GDAL installation is less problematic than needing to maintain an R package that has all that capability.
@edzer may have a bit more nuanced opinion though?
I would be willing to help with a base R zarr reader package that more or less paralleled the RNetCDF function signatures so we had an equivalent without the NetCDF-C API as a dependency, but it would be a fairly major undertaking.
This appears to be the furthest along base R implementation.
@edzer may have a bit more nuanced opinion though?
I have made good experience with the GDAL multidimensional array API, reported here. This API is relatively new, though, and maybe not used on a wide variety of Zarr files. I would be happy to see experiments reading Zarr files building on R packages that only use the NetCDF (NCZarr?) interface, if only to cross check.
We successfully convinced the CRAN maintainters to include the BLOSC library in the packaging systems for binary (macos & windows) package distros (these come with a complete statically linked copy of GDAL + all its dependencies).
The main advantage of GDAL IMO is that it can read and write the coordinate reference system in a way the geo world is used to.
This is so interesting to me. Never in a million years would we consider using GDAL to read NetCDF files for climate data analysis from python world. I've never heard of anyone ever doing that.
I wouldn't recommend doing it using the raster or vector API, but the multidimensional array API (and of course the spatial reference system API) are good at this!
Agreed. It seems odd. People working in R specifically on climate data would be in the same boat, but anyone who wants to integrate with geospatial data end up having to implement the integration themselves. Classically, this has been done through the NetCDF-C API with RNetCDF (which mirrors the NetCDF-C API) and ncdf4 (which implements an opinionated approach to reading CF-like NetCDF data)
I implemented the direct NetCDF adapter for stars
(https://r-spatial.github.io/stars/reference/read_ncdf.html) as well as some of the adapter components in nc-meta
(https://hypertidy.github.io/ncmeta/) that allow you to go from CF-NetCDF to in-memory geospatially-referenced data. Trick is -- the multidimensional GDAL API does the same thing and is as good or better. So as long as it's not a blocker to get GDAL into an environment, there's very little argument to maintain an alternate implementation in base R.
I want to record a dialogue I had with @JonathanGregory for the record. I've included the substance of an email chain here but have removed some superfluous content (Dear Dave, Best Regards, etc.).
TL;DR;
@JonathanGregory points out:
...and...
...and...
@dblodgett-usgs concludes:
Complete exchange follows.
DB:
JG:
DB:
JG:
DB:
JG:
DB:
DB: to @rabernat, @briannapagan, and @christophenoel
@rabernat:
DB:
JG: