zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Add GeoTransform as implemented by GDAL #19

Open dblodgett-usgs opened 1 year ago

dblodgett-usgs commented 1 year ago

For consideration. It's rough, but I think this is more or less what we are after in #17 ??

I think the logic to just adopt what's already in GDAL makes a lot of sense.

This isn't incompatible with CF in that data that uses this approach just wouldn't have georeference information in a client that expects normal coordinate variables.

dblodgett-usgs commented 1 year ago

@christophenoel -- I've altered my terminology here to avoid the Auxiliary term per discussion in #17

briannapagan commented 1 year ago

@dblodgett-usgs can you identify 1-2 reviewers you feel that would have substantive inputs? Is this something we can already prototype interoptibility?

dblodgett-usgs commented 1 year ago

I don't have permissions to add reviewers formally, but perhaps @christophenoel, @rouault, @edzer (others in the R spatial community?), @snowman2 (others in the python spatial community)?

We could certainly prototype it -- it's actually more or less in use in the wild since it's adopting what GDAL already does.

briannapagan commented 1 year ago

We could certainly prototype it -- it's actually more or less in use in the wild since it's adopting what GDAL already does.

Could you attach an example zarr file then?

snowman2 commented 1 year ago

Sounds like a good idea. Relevant references:

dblodgett-usgs commented 1 year ago

@briannapagan -- I'm not really in a position to be able to create a zarr mock up. Maybe someone a little closer to the implementation could do it? The NetCDF CDL representation would look like:

float data(latitude, longitude) ;
    data:grid_mapping = "crs: latitude, longitude" ;
    ...
  int crs ;
    crs:geotransformation_matrix = Xoff, X0, X1, Yoff, Y0, Y1 ;
    crs:grid_mapping_name = "latitude_longitude";
    crs:longitude_of_prime_meridian = 0.0 ;
    crs:semi_major_axis = 6378137.0 ;
    crs:inverse_flattening = 298.257223563 ;
    crs:crs_wkt =
     GEODCRS["WGS 84",
     DATUM["World Geodetic System 1984",
       ELLIPSOID["WGS 84",6378137,298.257223563,
         LENGTHUNIT["metre",1.0]]],
     PRIMEM["Greenwich",0],
     CS[ellipsoidal,3],
       AXIS["(lat)",north,ANGLEUNIT["degree",0.0174532925199433]],
       AXIS["(lon)",east,ANGLEUNIT["degree",0.0174532925199433]],
       AXIS["ellipsoidal height (h)",up,LENGTHUNIT["metre",1.0]]]
dblodgett-usgs commented 1 year ago

@snowman2 -- so what I take away from your references is that this is basically already supported in more than just GDAL? What do you think about space separated vs an actual vector of doubles?

e.g. https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray.rioxarray.XRasterBase.write_transform

Based on this, I think it is sensible to just move ahead with this proposal. The only catch is that gdal uses space seperated strings and it may be good to support both strings and explicit vector attributes?

snowman2 commented 1 year ago

What do you think about space separated vs an actual vector of doubles?

The approach rioxarray has taken is to maximize compatibility. So, the format that works with GDAL is the version used by rioxarray. It would be nice if the storage format here is compatible with both netCDF and zarr in GDAL.

dblodgett-usgs commented 1 year ago

Per updates and discussion on the community call today, we generally have agreement on the current PR. Does anyone tagged here have interest in review / expressing support?

The text now describes that the origin is intended to be a cell corner and shows the equation to resolve cell centers.

We have settled on space delimited ascii for compatibility with existing implementations.

We are using a GeoTransform attribute on the grid_mapping variable, largely following CF and compatible with GDAL and rioxarray.

briannapagan commented 1 year ago

I would be willing to review if we can get an example file to test. @kapadia can you cross check what you posted here: https://github.com/zarr-developers/geozarr-spec/issues/10 versus this PR?

christophenoel commented 1 month ago

It seems to me that this PR aligns with the desire to support multiple coordinate encodings (not only coordinate arrays), which I fully support.

Note that for indicating the projection of a n-D data variable, the current convention is to define in the data array a property called grid_mapping, indicating the name of the auxiliary variable (a Zarr array whcih is not data, neither coordinate) that contains the attributes describing this grid mapping. The auxiliary variables that contains the "grid_mapping" is an empty array and contains the crs attributes (as shown in above example)

Therefore, we could also use as proposed this grid_mapping property for describing origin_offset coordinates (using the GeoTransform).

However, I think that for consistency, each dimension of a variable should map to a coordinate variable. For origin_offset instead of containing a 1D array of all coordinates values, the Zarr Array, would contain either:

To maximise/facilitate compatibility with map tools / generic client, I would suggest to preserve the grid_mapping when provided, but to impose the origin_offset tuple for each coordinate variable.

ashiklom commented 6 days ago

Here's a rough draft at a suggestion (assuming an extension of the Zarr v3 spec). The specific granule I am using here is HLS.S30.T11XNE.2024176T201849.v2.0.B01.tiff. Full output of gdalinfo on that file is in the "Details".

Something to consider.

import rasterio

rast = rasterio.open("sample-data/HLS.S30.T11XNE.2024176T201849.v2.0.B01.tiff")

# Creating a `zarr.json` dict "by hand".
zarr_meta = {
    "zarr_format": 3,
    "shape": rast.shape,  # (3660, 3660),
    # Other stuff...
    "chunk_grid": {
        "name": "raster",  # NOTE: custom grid type
        "configuration": {
            "crs": {
                # Either an EPSG code
                "epsg": rast.crs.to_epsg(),  # 32611
                # ...or a WKT string
                "wkt": rast.crs.to_wkt()    # 'PROJCS["WGS 84 / UTM zone 11N",GEOGCS[<etc...>]]'
            },
            "transform": list(rast.transform),   # [30.0, 0.0, 499980.0, 0.0, -30.0, 8500020.0, 0.0, 0.0, 1.0],

            # Is each pixel a point or an area? See https://docs.ogc.org/is/19-008r4/19-008r4.html#RasterSpace
            # See also https://docs.ogc.org/is/19-008r4/19-008r4.html#_cookbook_for_defining_transformations
            "area_or_point": rast.tags()["AREA_OR_POINT"]  # 'Area'
        }
    }
}

Some additional notes:

  1. A tricky thing about the affine transform is that if there is any rotation (which is very common!), we cannot use lat and lon (or whatever projected coordinates) as dimensions anymore (or, at least, x and y no longer map cleanly onto lon and lat, respectively). In other words: Let transform = [a, b, c, d, e, f, g, h]. In that case, lon = ax + by + c and lat = ex + fy + g (or something like that). For nonzero values of b and f, we now have to treat lat and lon as data, not dimensions, because they depend on both x and y. That means we have to think differently about how Xarray's dat.sel(lat = slice(...)) for this kind of dataset would work --- we can't just define lat and lon as dimensions and then proceed as usual.

    • That said, the whole point of an affine transform is that a .sel(lat = ...) operation doesn't have to do a lookup; it just has to solve the system of equations above for x and y. When both indices are specified (.sel(lon = n, lat = m, method = "nearest")), this has a unique solution --- you get one value back (can be solved with matrix algebra). When just one index is specified (.sel(lon = n, method = "nearest")), the solution is a list of x and y values.
    • Another school of thought is that .sel is fundamentally the wrong abstraction for this and, rather than trying to get GeoZarr working with vanilla xarray, we should get GeoZarr working with rioxarray and just use rioxarray's rio GDAL-based methods for these kinds of spatial subsets (maybe, with some extra syntactic sugar).
  2. The GeoTIFF area_or_point thing is roughly analogous to CF's idea of points vs. cells (https://cfconventions.org/cf-conventions/cf-conventions.html#_data_representative_of_cells). When area_or_point = "Point", that is analogous to the CF default --- the data are exactly at the coordinates specified. When area_or_point = "Area", that is analogous to a CF "bounds" situation --- the data fall between coordinates, in the middle of the cell. The relevant GeoTIFF convention is described here: https://docs.ogc.org/is/19-008r4/19-008r4.html#RasterSpace. The relevant GDAL documentation for GeoTIFF (where the AREA_OR_POINT variable is defined) is described here: https://gdal.org/drivers/raster/gtiff.html#metadata

``` # gdalinfo sample-data/HLS.S30.T11XNE.2024176T201849.v2.0.B01.tiff Driver: GTiff/GeoTIFF Files: sample-data/HLS.S30.T11XNE.2024176T201849.v2.0.B01.tiff Size is 3660, 3660 Coordinate System is: PROJCRS["WGS 84 / UTM zone 11N", BASEGEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]], CONVERSION["UTM zone 11N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-117, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Navigation and medium accuracy spatial referencing."], AREA["Between 120°W and 114°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - Alberta; British Columbia (BC); Northwest Territories (NWT); Nunavut. Mexico. United States (USA). "], BBOX[0,-120,84,-114]], ID["EPSG",32611]] Data axis to CRS axis mapping: 1,2 Origin = (499980.000000000000000,8500020.000000000000000) Pixel Size = (30.000000000000000,-30.000000000000000) Metadata: ACCODE=LaSRC add_offset=0.0 AREA_OR_POINT=Area arop_ave_xshift(meters)=0 arop_ave_yshift(meters)=0 arop_ncp=0 arop_rmse(meters)=0 arop_s2_refimg=NONE cloud_coverage=8 DATASTRIP_ID=S2B_OPER_MSI_L1C_DS_2BPS_20240624T223120_S20240624T201851_N05.10 HLS_PROCESSING_TIME=2024-06-26T13:18:32Z HORIZONTAL_CS_CODE=EPSG:32611 HORIZONTAL_CS_NAME=WGS84 / UTM zone 11N L1C_IMAGE_QUALITY=NONE L1_PROCESSING_TIME=2024-06-24T23:19:08.431601Z long_name=Coastal_Aerosol MEAN_SUN_AZIMUTH_ANGLE=191.826619464522 MEAN_SUN_ZENITH_ANGLE=52.9361257682694 MEAN_VIEW_AZIMUTH_ANGLE=125.768796718571 MEAN_VIEW_ZENITH_ANGLE=6.74140452066478 MSI band 01 bandpass adjustment slope and offset=0.995900, -0.000200 MSI band 02 bandpass adjustment slope and offset=0.977800, -0.004000 MSI band 03 bandpass adjustment slope and offset=1.007500, -0.000800 MSI band 04 bandpass adjustment slope and offset=0.976100, 0.001000 MSI band 11 bandpass adjustment slope and offset=1.000000, -0.000300 MSI band 12 bandpass adjustment slope and offset=0.986700, 0.000400 MSI band 8a bandpass adjustment slope and offset=0.996600, 0.000000 NBAR_SOLAR_ZENITH=53.0242290542587 NCOLS=3660 NROWS=3660 OVR_RESAMPLING_ALG=NEAREST PROCESSING_BASELINE=05.10 PRODUCT_URI=S2B_MSIL1C_20240624T201849_N0510_R071_T11XNE_20240624T223120.SAFE scale_factor=0.0001 SENSING_TIME=2024-06-24T20:23:29.449379Z SPACECRAFT_NAME=Sentinel-2B spatial_coverage=98 SPATIAL_RESOLUTION=30 TILE_ID=S2B_OPER_MSI_L1C_TL_2BPS_20240624T223120_A038135_T11XNE_N05.10 ULX=499980 ULY=8500020 _FillValue=-9999 Image Structure Metadata: COMPRESSION=DEFLATE INTERLEAVE=BAND PREDICTOR=2 Corner Coordinates: Upper Left ( 499980.000, 8500020.000) (117d 0' 2.78"W, 76d34'51.70"N) Lower Left ( 499980.000, 8390220.000) (117d 0' 2.59"W, 75d35'49.26"N) Upper Right ( 609780.000, 8500020.000) (112d46'11.37"W, 76d32'44.49"N) Lower Right ( 609780.000, 8390220.000) (113d 3' 8.26"W, 75d33'51.04"N) Center ( 554880.000, 8445120.000) (114d57'21.20"W, 76d 4'49.86"N) Band 1 Block=256x256 Type=Int16, ColorInterp=Gray Description = Coastal_Aerosol NoData Value=-9999 Overviews: 1830x1830, 915x915, 458x458, 229x229 Offset: 0, Scale:0.0001 ```
mdsumner commented 6 days ago

Where do you get that rotation in affine terms is common? It's extremely rare, and it's really better to work with an extent in most cases which is simpler to use and think about. No one uses a transform for specifying the output grid for regridding for example, though that's probably how the format will store it (an affine transform is great of course and a massive step forward for xarray and Zarr for when rectilinear is degenerate (which is extremely common) or curvilinear is degenerate (less common but out there and even more cryptic)).

christophenoel commented 5 days ago
zarr_meta = {
    "zarr_format": 3,

Hello,

This looks like an interesting idea. However, I am not yet familiar with the attributes in Zarr V3. Based on V2, in a dataset (as outlined in the current draft of the specification), either the Zarr group of the dataset or a child (empty) Zarr array (such as grid_mapping) would hold such metadata in the .zattrs file (zarr.json in V3).

ashiklom commented 5 days ago

Where do you get that rotation in affine terms is common?

@mdsumner I may be wrong! I thought it was fairly common for things like airborne and (lower-level) satellite data, where the rotation is oriented in the direction of the flight path / orbit and you can save some space by not having big corners of NaN in every scene. That said, my example above is harmonized Landsat-Sentinel (HLS), a classic example of this kind of data, and it's not rotated...so maybe this is rarer than I thought. If most data really are strictly north-south oriented, then great! That makes the problem much simpler. And maybe it's somewhat out of scope for GeoZarr to have really good support for rotated grids.

However, I am not yet familiar with the attributes in Zarr V3. Based on V2, in a dataset (as outlined in the current draft of the specification), either the Zarr group of the dataset or a child (empty) Zarr array (such as grid_mapping) would hold such metadata in the .zattrs file (zarr.json in V3).

@christophenoel Agreed that exactly where these metadata should live is a bit of an open question. I don't have strong opinions about it. I do think that we need to be proactive in supporting Zarr v3, though, so that GeoZarr doesn't immediately become outdated. If it's not too much work to also support v2, of course we should do that, too.

mdsumner commented 5 days ago

I think there are some satellite streams that have this rotation (you need to have it as capability!) but in 20 years I've only encountered it a couple of times. FWIW it's not about north-south orientation, consider a map of Antarctica in Polar Stereographic. It's just a basic offset and scale situation in units-of-the-projection (same as any other regular grid, in Mercator, LAEA, or longlat etc).

christophenoel commented 4 days ago

@christophenoel Agreed that exactly where these metadata should live is a bit of an open question. I don't have strong opinions about it. I do think that we need to be proactive in supporting Zarr v3, though, so that GeoZarr doesn't immediately become outdated. If it's not too much work to also support v2, of course we should do that, too.

This seems indeed totally feasible. I hope to study if supporting both would be straightforward in the next weeks.

mdsumner commented 4 days ago

It belongs in xarray (and hopefully the cross lang library that becomes ...)

They're working on it