zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

http zarr proxy #2033

Open d-v-b opened 1 month ago

d-v-b commented 1 month ago

I often find myself wanting a way to present non-zarr array data as if it was a zarr array, without going through the effort (and data duplication) required to convert the array to zarr. This often comes up when working with unchunked legacy file formats.

A simple way to do this would be via an HTTP server that converts incoming requests for chunk files into whatever operations are necessary to retrieve chunk data from the legacy format. In the simplest formulation, the server would need to be configured with a declaration of the data being "zarred", and runtime options like compression etc.

If the program were called zerve (for "zarr serve"), then invocation might look like this:

zerve path/to/array.tif --path my_array
   ┌──────────────────────────────────────────────────┐
   │                                                  │
   │   Serving!                                       │
   │                                                  │
   │   - Local:    http://localhost:33907/my_array    │
   │   - Network:  http://192.168.1.160:33907/my_array│
   │                                                  │
   │   This port was picked because 3000 is in use.   │
   │                                                  │
   │   Copied local address to clipboard!             │
   │                                                  │
   └──────────────────────────────────────────────────┘

I copied the terminal output from the nodejs program serve, that I often use for statically hosting zarr arrays. One can imagine more elaborate JSON configuration for the server that would declare how to embed the array (or multiple arrays) in a virtual group.

The particular use case I described (wrapping a legacy format in a zarr API) is a specific instance of a more general pattern (wrapping X in a zarr API, where X is some arbitrary computation that produces an array). If designed in a modular way, the http proxy i'm proposing would support this broader usage pattern. And we can also imagine this functionality being used as a python library, a la xpublish. In fact, I think xpublish basically does what I want here, so maybe the only work is to pull the zarr proxying out of xpublish?

In terms of where this should live in zarr-python, i think this functionality would be very useful for testing http-based storage, so on those grounds I think it's in-scope for zarr-python, but not a pressing need.

cc @jhamman

joshmoore commented 1 month ago

see also https://github.com/manzt/simple-zarr-server (cc: @manzt)

manzt commented 1 month ago

The zerve name is much better than simple-zarr-server :) Happy to chat about upstreaming some of that work. At some point I'd looked into using package entrypoints to register napari-like plugins to map file arg -> zarr python store, which could then use the rest of the serving capabilities.

# pip install my-zarr-mapper-plugin
simple-zarr-server path/to/my/custom/format.foo --path=bar

That would let me expose the custom stores in:

nicely over HTTP. A function that maps an arbitrary file to zarr python interface (mutable-mapping like thing), could be the entry point, since it's very straightforward to implement REST on top of the multable-mapping thing.

d-v-b commented 1 month ago

At some point I'd looked into using package entrypoints to register napari-like plugins to map file arg -> zarr python store, which could then use the rest of the serving capabilities

This is an awesome idea!

As for upstreaming, if we have enough developer energy I think we should have 2 instantiations of this tool, in different python packages:

I think simple-zarr-server is closer to the second option. I'd be curious to see how much code it would take to convert it into the first option.

manzt commented 1 month ago

I think simple-zarr-server is closer to the second option. I'd be curious to see how much code it would take to convert it into the first option.

simple-zarr-server is pretty minimal. It basically prepares an ASGI app with starlette, and then uses uvicorn to run it.

We could maybe try to define the minimum ASGI builder without deps in zarr-python, and then all the standalone stuff could include tools to actually run/extend that "core" mapper. I've never looked into writing that part from scratch with the standardlib, but I could take a look. Mapping paths to bytes shouldn't requires many bells and whistles I imagine. I chose starlette at the time because it was minimal and performant.