Open briannapagan opened 5 months ago
As a Zarr Sprint track focused on enabling support for V3 in Zarr-Python we are joining an ongoing effort working toward Zarr-Python version 3.0 (roadmap). Our focus is on closing outstanding issues on the roadmap and testing the development branch in common geospatial applications. Zarr-Python has traditionally been the canonical implementation of Zarr, therefore we believe this effort delivers immediate impact to the largest swatch of users, including those that use Zarr through downstream libraries (e.g. Xarray). This will be achieved when any of the roadmap issues are closed or some of the following objectives are completed:
The types of skills we need to complete this task are moderate to advanced familiarity with Python and Zarr. We expect the level of difficulty to complete this to be medium to high.
A Zarr Sprint track focused on geospatial multi-scales / pyramids.
Our focus is on identifying and addressing shortcomings of the ndpyramid utility, either though development in that library or deciding where else development would need to happen.
ndpyramid
is a utility for generating pyramids for Zarr datasets to enable performant visualization. The library was built specifically for use with the @carbonplan/maps
toolkit and produces pyramids conforming to this schema. There has been persistent community interest in broader support and standards for pyramids in Zarr. The focus of this sprint is on establishing whether ndpyramid could provide the foundation for this support, and, if so, develop towards that goal, or, if not, establish where else development would happen.
We anticipate that this sprint will progress development towards geospatial pyramids in Zarr that can be used broadly for dynamic client visualization approaches, tiling servers, and multi-scale analysis. This will serve data providers, front-end developers, and researchers.
This will be achieved when:
@carbonplan/maps
, titiler-xarray
, QGIS
, Datashader
, and a SRCNN, all sharing the same schema.Some potentially more attainable goals for the short sprint:
The types of skills needed complete this task are moderate familiarity with Python and Zarr. We would especially encourage participate from those familiar with geospatial projections, multiscale representations, and metadata conventions.
We expect the level of difficulty to complete this to be medium.
Our focus is on achieving the ability to explore nested Zarr groups over HTTP or other stores that do not provide a LIST-style operation.
This will enable
Non-goals:
We will try to implement this capability in zarr-python on the V3 branch. Contributors should be intermediate Python programmers (understand best practices around Python objects, typing, and code structure). Familiarity with the Zarr code base is not required but helpful. Participants should review the V3 roadmap and design document.
I'm also open to implementing this first in a javascript library, rather than Python. For example, in the source.coop viewers package.
Along the lines of my comment above, I have a concrete proposal that could be fun for someone (like @kylebarron 😉) to work on.
Zarr Python's V3 store interface is being redesigned to provide an all-async interface. The idea we have been discussing is to write a store on top of the Rust Object-Store crate. There are already Python bindings for this project but they are not async ready. If this particular plan is successful, it is possible this could become the core store in the zarr-python project.
@jhamman , been readying https://docs.rs/pyo3-asyncio/latest/pyo3_asyncio/index.html very carefully. Given I already did this once in rfsspec, I am prepared to give it a go on top of object-store. rfsspec showed marginal benefits, so while it may be worthwhile, do not expect a big return for the probably substantial effort.
Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple! We also want to enable dask-style access from multiple (python) threads, so... Also, python bytes
objects are annoying in rust (numpy buffers would be better, even for bytes output).
(I would be interested in this, because a rust-only zarr and kerchunk solution is very generally interested for those that need a C-level API; however, if we don't use numpy as the storage, and we don't have numcodecs directly, it maybe asks more questions than it answers; cf https://github.com/sci-rs/zarr ).
Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple
FWIW the next version of pyo3 is likely to have big progress in async handling, and it sounds like it might no longer need two event loops? https://github.com/PyO3/pyo3/issues/1632#issuecomment-1752582018
Also, python
bytes
objects are annoying in rust (numpy buffers would be better, even for bytes output).
Why are they annoying? Is it because the memory is Python-allocated instead of Rust-allocated?
the memory is Python-allocated instead of Rust-allocated?
Yes, but also the internal immutability guarantee makes zero-copy handing the memory to/from rust hard. In rfsspec, I already wrote code around the python buffer protocol to cope with this, which appears to work but sidelines rust's memory protections.
An additional, v3 sprint topic idea, this one aimed at @TomNicholas. Manifest storage transformer. Specific goals for this sprint could be to:
In the Zarr pyramids breakout group, Thomas Maschler and I discussed the motivations for following the OGC TileMatrixSet 2.0 specification within the GeoZarr specification, which will be shared as a new issue to supersede https://github.com/zarr-developers/geozarr-spec/issues/30. We also discussed reading those TMS into rio-tiler using Xarray and started a refactor of ndpyramid to support the TMS specification.
Thanks all!
I added two example scripts for interactive geozarr in qgis.
In the "chunk manifest / virtual concatenation" group our main outcome was a long technical discussion, which I've written up in ZEP-like form here https://github.com/zarr-developers/zarr-specs/issues/288#issuecomment-1939265240
Per our discussions in the bi-weekly GeoZarr SWG meeting, we identified a few focus tracks for the zarr sprint coming up on February 7/8th, 2024. In addition, I reviewed the original brainstorming ideas first discussed a year ago documented https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A?both#Let’s-brainstorm. Many of these ideas are captured by the proposed list we discussed on January 24th. The topic of bidirectional interoperability with gdal is another clear theme, although as we discussed at the last SWG meeting, this would be very difficult to tackle in a single sprint and more importantly we may not have someone to lead this. Nevertheless I am listing it as an option to see if we could identify folks in the community to lead.
Here are the topics I have narrowed down to:
Here is the proposed template that I ask the folks who are tagged as leading the tracks above to complete and share below.
Topic leaders, if you can fill in the above template by Monday January 29th, then as a community, we provide ranked responses by Wednesday January 31st.