zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Example Zarr file containing spatiotemporal remote sensing imagery #10

Closed kapadia closed 10 months ago

kapadia commented 1 year ago

Per request from our last meeting, I’m sharing a Zarr file containing spatiotemporal observations of remote sensing imagery. Following GDAL's Zarr documentation, this file contains a _CRS attribute in WKT format. Additionally, an affine transform is also appended allowing for pixel to world coordinate conversions. Following GDAL’s current behavior, these attributes are stored on the Zarr Array rather than the Zarr Group.

https://storage.googleapis.com/pl-amit-public/geozarr/planet-fusion.zarr.zip

Two issues were observed in this process:

If you'd like to observe the second issue, it can be reproduced using the following commands.

gdal_translate /vsicurl/https://github.com/rasterio/rasterio/blob/4c01600b0e8c8841c661cbaf1297ac52f601fc3a/tests/data/RGB.byte.tif?raw=true RGB.byte.zarr -of Zarr
python -c "import xarray; xarray.open_zarr('RGB.byte.zarr')"

I hope the file and these observations are useful to this group.

Edit:

This is the metadata associated with each array in the file above

<xarray.DataArray 'sr' (datetime: 365, band: 4, row: 256, col: 256)>
dask.array<open_dataset-e9e64732163986119bc61b23a21b10a4sr, shape=(365, 4, 256, 256), dtype=int16, chunksize=(365, 4, 32, 32), chunktype=numpy.ndarray>
Coordinates:
  * band      (band) <U5 'blue' 'green' 'red' 'nir'
  * datetime  (datetime) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31
Dimensions without coordinates: row, col
Attributes:
    _CRS:        PROJCS["WGS 84 / UTM zone 14N",GEOGCS["WGS 84",DATUM["WGS_19...
    _TRANSFORM:  [3.0, 0.0, 696000.0, 0.0, -3.0, 4536000.0]
rabernat commented 1 year ago

I was able to open this Zarr file successfully using Xarray, with no errors. I'm on Zarr v2.14.2 and Xarray v2023.2.0.

<xarray.Dataset>
Dimensions:  (Y: 718, X: 791)
Coordinates:
  * X        (X) float64 1.021e+05 1.024e+05 1.027e+05 ... 3.389e+05 3.392e+05
  * Y        (Y) float64 2.827e+06 2.826e+06 2.826e+06 ... 2.612e+06 2.612e+06
Data variables:
    Band1    (Y, X) float32 ...
    Band2    (Y, X) float32 ...
    Band3    (Y, X) float32 ...
kapadia commented 1 year ago

After discussion around this example, here is the code used to generate the example file:

import requests

import pandas as pd
import numpy as np
import dask.array as da
import dask
from dask.distributed import Client
import zarr
from numcodecs import Zstd
import xarray as xr
import rasterio as rio
from rasterio.windows import Window

def read_raster(uri, window):

    with rio.open(uri) as src:
        arr = src.read(window=window)

    return arr

def read_stac_item(uri):
    r = requests.get(uri)
    feature = r.json()

    datetime = pd.to_datetime(feature['properties']['datetime'])
    asset = feature['assets']['sr']
    bands = [item['common_name'] for item in asset['eo:bands']]
    band_count = len(bands)
    image_uri = asset['href']
    height = 256
    width = 256
    dtype = np.int16
    window = Window(col_off=0, row_off=0, width=width, height=height)

    image_arr = da.from_delayed(
        dask.delayed(read_raster)(image_uri, window),
        shape=(band_count, height, width),
        dtype=dtype)

    return xr.DataArray(image_arr[None, :],
        coords=dict(
            datetime=[datetime],
            band=bands
        ),
        dims=["datetime", "band", "row", "col"],
    )

def main():

    client = Client(processes=False)

    base_url = "https://www.planet.com/data/stac/fusion/14N/29E-188N"
    collection_url = f"{base_url}/collection.json"
    r = requests.get(collection_url)
    collection = r.json()
    item_uris = [
        f"{base_url}/{link['href']}" for link in collection['links'] if link['rel'] == 'item']

    # Probe one image to get CRS
    r_item = requests.get(item_uris[0])
    image_uri = r_item.json()['assets']['sr']['href']
    with rio.open(image_uri) as src:
        crs = src.crs.to_wkt()
        transform = list(src.transform)[0:6]

    futures = client.map(read_stac_item, item_uris)
    arrays = client.gather(futures)
    ds = xr.concat(arrays, dim='datetime', combine_attrs="identical").chunk(
        datetime=-1, band=-1, row=32, col=32)
    ds['datetime'] = pd.to_datetime(ds.datetime)
    ds.attrs = {
        **ds.attrs,
        "_CRS": crs,
        "_TRANSFORM": transform
    }

    dataset = xr.Dataset({
        "sr": ds
    })

    destination = "planet-fusion.zarr"
    dataset.to_zarr(
        destination,
        encoding={
            variable: {
                "compressor": zarr.Blosc(cname="zstd", clevel=3, shuffle=2)
            } for variable in dataset
    })

main()
kapadia commented 1 year ago

A new file is available that uses the proposal in #19.

https://storage.googleapis.com/pl-amit-public/geozarr/planet-fusion-20230510.zarr.zip

In [1]: import xarray as xr

In [2]: ds = xr.open_zarr('planet-fusion.zarr')

In [3]: ds
Out[3]:
<xarray.Dataset>
Dimensions:   (band: 4, datetime: 365, row: 256, col: 256)
Coordinates:
  * band      (band) <U5 'blue' 'green' 'red' 'nir'
  * datetime  (datetime) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31
Dimensions without coordinates: row, col
Data variables:
    sr        (datetime, band, row, col) int16 dask.array<chunksize=(365, 4, 32, 32), meta=np.ndarray>

In [4]: ds.sr
Out[4]:
<xarray.DataArray 'sr' (datetime: 365, band: 4, row: 256, col: 256)>
dask.array<open_dataset-78e64952e25134fc0fcda392329a7e9bsr, shape=(365, 4, 256, 256), dtype=int16, chunksize=(365, 4, 32, 32), chunktype=numpy.ndarray>
Coordinates:
  * band      (band) <U5 'blue' 'green' 'red' 'nir'
  * datetime  (datetime) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-12-31
Dimensions without coordinates: row, col
Attributes:
    GeoTransform:  [3.0, 0.0, 696000.0, 0.0, -3.0, 4536000.0]
    _CRS:          PROJCS["WGS 84 / UTM zone 14N",GEOGCS["WGS 84",DATUM["WGS_...

In [5]: xr.__version__
Out[5]: '2023.4.2'

cc. @briannapagan