zarr-developers / geozarr-spec

This document aims to provides a geospatial extension to the Zarr specification. Zarr specifies a protocol and format used for storing Zarr arrays, while the present extension defines conventions and recommendations for storing multidimensional georeferenced grid of geospatial observations (including rasters).
106 stars 10 forks source link

Zarr Sprint Topics #33

Open briannapagan opened 5 months ago

briannapagan commented 5 months ago

Per our discussions in the bi-weekly GeoZarr SWG meeting, we identified a few focus tracks for the zarr sprint coming up on February 7/8th, 2024. In addition, I reviewed the original brainstorming ideas first discussed a year ago documented https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A?both#Let’s-brainstorm. Many of these ideas are captured by the proposed list we discussed on January 24th. The topic of bidirectional interoperability with gdal is another clear theme, although as we discussed at the last SWG meeting, this would be very difficult to tackle in a single sprint and more importantly we may not have someone to lead this. Nevertheless I am listing it as an option to see if we could identify folks in the community to lead.

Here are the topics I have narrowed down to:

  1. pyramiding @maxrjones, virtual
  2. http browsable zarr @rabernat @kbgg
  3. zarr v3 for zarr-python @jhamman, virtual
  4. gdal bidirectional interoperability (???)

Here is the proposed template that I ask the folks who are tagged as leading the tracks above to complete and share below.

As A Zarr Sprint track... 
Our focus is on <Outcome>
We believe it delivers <Impact> to <Whom>
This will be achieved when <some acceptance criteria>
The types of skills we need to complete this task are <some list>
We expect the level of difficulty to complete this to be <low, medium, high>

Topic leaders, if you can fill in the above template by Monday January 29th, then as a community, we provide ranked responses by Wednesday January 31st.

jhamman commented 5 months ago

As a Zarr Sprint track focused on enabling support for V3 in Zarr-Python we are joining an ongoing effort working toward Zarr-Python version 3.0 (roadmap). Our focus is on closing outstanding issues on the roadmap and testing the development branch in common geospatial applications. Zarr-Python has traditionally been the canonical implementation of Zarr, therefore we believe this effort delivers immediate impact to the largest swatch of users, including those that use Zarr through downstream libraries (e.g. Xarray). This will be achieved when any of the roadmap issues are closed or some of the following objectives are completed:

The types of skills we need to complete this task are moderate to advanced familiarity with Python and Zarr. We expect the level of difficulty to complete this to be medium to high.

maxrjones commented 5 months ago

A Zarr Sprint track focused on geospatial multi-scales / pyramids.

Our focus is on identifying and addressing shortcomings of the ndpyramid utility, either though development in that library or deciding where else development would need to happen.

ndpyramid is a utility for generating pyramids for Zarr datasets to enable performant visualization. The library was built specifically for use with the @carbonplan/maps toolkit and produces pyramids conforming to this schema. There has been persistent community interest in broader support and standards for pyramids in Zarr. The focus of this sprint is on establishing whether ndpyramid could provide the foundation for this support, and, if so, develop towards that goal, or, if not, establish where else development would happen.

We anticipate that this sprint will progress development towards geospatial pyramids in Zarr that can be used broadly for dynamic client visualization approaches, tiling servers, and multi-scale analysis. This will serve data providers, front-end developers, and researchers.

This will be achieved when:

Some potentially more attainable goals for the short sprint:

The types of skills needed complete this task are moderate familiarity with Python and Zarr. We would especially encourage participate from those familiar with geospatial projections, multiscale representations, and metadata conventions.

We expect the level of difficulty to complete this to be medium.

rabernat commented 5 months ago

Zarr Linked Hierarchy for HTTP-enabled Browsing

Focus and Outcomes

Our focus is on achieving the ability to explore nested Zarr groups over HTTP or other stores that do not provide a LIST-style operation.

This will enable

More Context

Zarr is not a file format; it is a specification for how to organize a nested hierarchy of numerical arrays and metadata in storage. In order to explore the contents of a Zarr hierarchy, clients generally need the ability to list the contents of directories in the storage layer. For filesystems or s3-compatible object storage, this is straightforward. However, most cloud-native geospatial data formats provide first-class read-only support via a vanilla HTTP protocol. To address this need, Zarr V2 implemented a somewhat hacky “consolidated metadata” approach, in which all the metadata from a hierarchy are condensed into a single json file. This approach does not scale to very large, deeply nested Zarr hierarchies. Now that Zarr V3 has been ratified, there is an opportunity to develop an extension that supports this HTTP-browsing use case in a more scalable and robust way. Specifically, we imagine developing a STAC-like mechanism for explicit links between parent and child groups that allow an HTTP client to quickly traverse a Zarr hierarchy.

Requirements

Non-goals:

Implementation Plan and Skills Needed

We will try to implement this capability in zarr-python on the V3 branch. Contributors should be intermediate Python programmers (understand best practices around Python objects, typing, and code structure). Familiarity with the Zarr code base is not required but helpful. Participants should review the V3 roadmap and design document.

I'm also open to implementing this first in a javascript library, rather than Python. For example, in the source.coop viewers package.

jhamman commented 5 months ago

Along the lines of my comment above, I have a concrete proposal that could be fun for someone (like @kylebarron 😉) to work on.

Zarr Python's V3 store interface is being redesigned to provide an all-async interface. The idea we have been discussing is to write a store on top of the Rust Object-Store crate. There are already Python bindings for this project but they are not async ready. If this particular plan is successful, it is possible this could become the core store in the zarr-python project.

martindurant commented 5 months ago

@jhamman , been readying https://docs.rs/pyo3-asyncio/latest/pyo3_asyncio/index.html very carefully. Given I already did this once in rfsspec, I am prepared to give it a go on top of object-store. rfsspec showed marginal benefits, so while it may be worthwhile, do not expect a big return for the probably substantial effort.

Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple! We also want to enable dask-style access from multiple (python) threads, so... Also, python bytes objects are annoying in rust (numpy buffers would be better, even for bytes output).

martindurant commented 5 months ago

(I would be interested in this, because a rust-only zarr and kerchunk solution is very generally interested for those that need a C-level API; however, if we don't use numpy as the storage, and we don't have numcodecs directly, it maybe asks more questions than it answers; cf https://github.com/sci-rs/zarr ).

kylebarron commented 5 months ago

Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple

FWIW the next version of pyo3 is likely to have big progress in async handling, and it sounds like it might no longer need two event loops? https://github.com/PyO3/pyo3/issues/1632#issuecomment-1752582018

Also, python bytes objects are annoying in rust (numpy buffers would be better, even for bytes output).

Why are they annoying? Is it because the memory is Python-allocated instead of Rust-allocated?

martindurant commented 5 months ago

the memory is Python-allocated instead of Rust-allocated?

Yes, but also the internal immutability guarantee makes zero-copy handing the memory to/from rust hard. In rfsspec, I already wrote code around the python buffer protocol to cope with this, which appears to work but sidelines rust's memory protections.

jhamman commented 4 months ago

An additional, v3 sprint topic idea, this one aimed at @TomNicholas. Manifest storage transformer. Specific goals for this sprint could be to:

  1. Evaluate the proposal - https://github.com/zarr-developers/zarr-specs/issues/287
  2. Hack together a small sample dataset using Kerchunk and bespoke translation code
  3. Break ground on Zarr-Python's first v3 storage transformer - the logic is actually quite easy but the internal hooks are not complete.
maxrjones commented 4 months ago

In the Zarr pyramids breakout group, Thomas Maschler and I discussed the motivations for following the OGC TileMatrixSet 2.0 specification within the GeoZarr specification, which will be shared as a new issue to supersede https://github.com/zarr-developers/geozarr-spec/issues/30. We also discussed reading those TMS into rio-tiler using Xarray and started a refactor of ndpyramid to support the TMS specification.

jhamman commented 4 months ago

Zarr-Python post-sprint update

  1. @rabernat and @maxrjones worked on Zarr-Python's test environment and CI setup (https://github.com/zarr-developers/zarr-python/issues/1648):
  2. @d-v-b worked on removing the attrs dependency from Zarr-Python (https://github.com/zarr-developers/zarr-python/issues/1624)
  3. @kylebarron worked on a prototype store using new async Python bindings to Rust's object-store project

Thanks all!

wietzesuijker commented 4 months ago

I added two example scripts for interactive geozarr in qgis.

  1. attempt at a simple GeoZarr with overviews and the geozarr spec's metadata
  2. add gdal supported _CRS attribute for crs detection in qgis
TomNicholas commented 4 months ago

In the "chunk manifest / virtual concatenation" group our main outcome was a long technical discussion, which I've written up in ZEP-like form here https://github.com/zarr-developers/zarr-specs/issues/288#issuecomment-1939265240