Open TomNicholas opened 2 months ago
What about "get_metadata"? That explicitly says what the function does, and it can live comfortably beside functions that read array data.
@douglatornell that certainly clearly describes what they do! I just don't know what the corresponding noun would be, vz.open_virtual_dataset(... metadata_getter=...)
hardly rolls off the tongue 😅
Perhaps I'm not understanding the nuances of the the API.
In vz.open_virtual_dataset(... metadata_getter=...)
, the metadata_getter=...
bit would be an item in reader_options
, correct? So, why not make it
vz.open_virtual_dataset(... metadata=vz.readers.get_kerchunk_metadata, ...)
for example?
But do you need to pass a function name at that level? Why not
vz.open_virtual_dataset(... metadata="kerchunk", ...)
and do the dispatching to the appropriate vz.readers.get_*_metadata()
function in vz.open_virtual_dataset()
where you are handling so many of the other details of different storage formats. I suppose doing that would preclude a user-defined get_*_metadata()
function, if that is a feature you want/need to support.
the
metadata_getter=...
bit would be an item inreader_options
, correct?
Not quite - reader_options
is passed on to kerchunk readers, but there are other readers that are not based on kerchunk (e.g. the dmr++ reader).
a user-defined
get_*_metadata()
function
I do think we want to support that. There are a few common file formats that are parseable to the zarr model, plus a long tail of niche formats that are also parseable to the zarr model (see https://github.com/zarr-developers/VirtualiZarr/issues/218). We should ship readers for the common ones bundled here, but allows users to plug their own readers in.
But do you need to pass a function name at that level?
In general I feel like we should be trying to move towards an entrypoint system, analogous to xarray.open_dataset
's BackendEntrypoint
class, which is what is accessed when you use the engine=...
kwarg to xr.open_dataset
.
I'm currently calling the functions that extract the metadata from a specific filetype (or many filetypes in kerchunk's case) "readers".
I feel like "reader" implies that they read actual array data into memory, but it doesn't do that. It's also confusing alongside Zarr "readers", which actually do read array data into memory
But is there a less overloaded term? e.g:
As of https://github.com/zarr-developers/VirtualiZarr/pull/231 these currently all live in a module called
virtualizarr.readers
.We also have
virtualizarr.writers
, but that's less of a problem given that (a) there will probably only ever be two of these, (b) there aren't existing xarray backends you could get this confused with, and (c) you are writing metadata, it's not a bad term for it.(Related to https://github.com/pydata/xarray/issues/9491, thought of this whilst proposing the API in #238)