zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Enumerating tool chains for geozarr in different ecosystems #5

Open rabernat opened 1 year ago

rabernat commented 1 year ago

As we discussed on today's call, it will be very useful to enumerate different tool chains used to read Zarr in a geospatial context in different languages / package ecosystems. This will help us understand where GeoZarr actually needs to be implemented.

I'll go first in a reply to this issue.

rabernat commented 1 year ago

Toolchain: Python / Pangeo (Zarr + Dask + Xarray + MetPy)

Overview

The Pangeo ecosystem has done a lot to move Zarr forward as a cloud native format. Here's what our stack looks like for the common use case of reading Zarr from cloud storage.

If you want to parse CRS information out of your Xarray data that has been loaded in this way, I believe that your only option is MetPy's assign_crs function, but I could be wrong about that.

Dependency chain

flowchart TD
    s3fs --> zarr-python
    zarr-python --> Dask
    Dask --> Xarray
    zarr-python --> Xarray
    Xarray --> MetPy
christophenoel commented 1 year ago

I'm not sure to see what you mean with the CRS. You might parse it with MetPy (and I suppose with Rioxarry) if you first set it, right ?

rabernat commented 1 year ago

I'm not sure to see what you mean with the CRS. You might parse it with MetPy (and I suppose with Rioxarry) if you first set it, right ?

What I mean is that Xarray by itself has no inherent understanding of CRS. The metadata are there, but not useful. MetPy's parse_cf function attaches an actual cartopy CRS object to the dataset, which can then be used for plotting. Here's an example I saw on twitter today: https://twitter.com/at_dot_Py/status/1627206421950664705

rabernat commented 1 year ago

To follow up on this issue, at today's meeting, we played around with trying to parse data from Brianna's example Zarr dataset into rioxarray. Notebook here

https://gist.github.com/rabernat/8c53c380fbb38dcf556e85487960a847

TomAugspurger commented 1 year ago

Here's an example of rioxarray (building on rasterio & GDAL) understanding CRS information (via rioxarray's support for CF conventions I believe).

import xarray as xr
import rioxarray
import requests

token = requests.get(
    "https://planetarycomputer.microsoft.com/api/sas/v1/token/daymet-daily-hi"
).json()["token"]

ds = xr.open_dataset("abfs://daymet-zarr/daily/hi.zarr", engine="zarr", decode_coords="all", storage_options={"account_name": "daymeteuwest", "credential": token})
print(ds.rio.crs)
print("---")
print(ds.rio.transform())

which prints out

PROJCS["undefined",GEOGCS["undefined",DATUM["undefined",SPHEROID["undefined",6378137,298.257223563]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["standard_parallel_1",25],PARAMETER["standard_parallel_2",60],PARAMETER["latitude_of_origin",42.5],PARAMETER["central_meridian",-100],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
---
| 1000.00, 0.00,-5802750.00|
| 0.00,-1000.00,-38500.00|
| 0.00, 0.00, 1.00|

In that case, the .crs object is a rasterio.crs.CRS. I believe it used pyproj to parse the CF metadata into a CRS object.

It doesn't appear that odc.geo correctly reads the geospatial metadata for this dataset. I'll share an example later of odc.geo correctly interpreting geospatial metadata.

rabernat commented 1 year ago

Super useful Tom! You issue reveals the problem that different python libraries use different in-memory representation of CRS / transform. This is separate from, but closely related to, the on-disk representation (e.g. #12).

christophenoel commented 1 year ago

@TomAugspurger : so the source metadata only includes CF projection (grid mapping) and rioxarray handles that ? I don't remember that this was working last year on my side. Surprising.

TomAugspurger commented 1 year ago

You issue reveals the problem that different python libraries use different in-memory representation of CRS / transform.

Yes, at some point I want to dig into why that is. As long as they're all using the same "protocol" things aren't so bad, but still it'd be nicer to consolidate if possible.

@TomAugspurger : so the source metadata only includes CF projection (grid mapping) and rioxarray handles that ? I don't remember that this was working last year on my side. Surprising.

Yes. I think it's some combination of https://github.com/corteva/rioxarray/blob/34c64140ba1f38e020e6d4b2cd0f3e4bb4c36f62/rioxarray/rioxarray.py#L262-L277 and https://github.com/corteva/rioxarray/blob/34c64140ba1f38e020e6d4b2cd0f3e4bb4c36f62/rioxarray/rioxarray.py#L307-L310 in rioxarray. It is important to use decode_coords="all" in this specific example, otherwise rioxarray isn't able to infer what to use for the CRS. I haven't looked at why.