"readers" or "parsers" or what should we call them?

zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax

https://virtualizarr.readthedocs.io/en/stable/api.html

Apache License 2.0

116 stars 22 forks source link

"readers" or "parsers" or what should we call them? #239

Open TomNicholas opened 2 months ago

TomNicholas commented 2 months ago

I'm currently calling the functions that extract the metadata from a specific filetype (or many filetypes in kerchunk's case) "readers".

I feel like "reader" implies that they read actual array data into memory, but it doesn't do that. It's also confusing alongside Zarr "readers", which actually do read array data into memory

But is there a less overloaded term? e.g:

"parsers" - because they usually involve scanning a file and "parsing" out the metadata to be separate from the array data
"scanner"?
"inspector"?
"extractor"?

As of https://github.com/zarr-developers/VirtualiZarr/pull/231 these currently all live in a module called virtualizarr.readers.

We also have virtualizarr.writers, but that's less of a problem given that (a) there will probably only ever be two of these, (b) there aren't existing xarray backends you could get this confused with, and (c) you are writing metadata, it's not a bad term for it.

(Related to https://github.com/pydata/xarray/issues/9491, thought of this whilst proposing the API in #238)

douglatornell commented 1 month ago

What about "get_metadata"? That explicitly says what the function does, and it can live comfortably beside functions that read array data.

TomNicholas commented 1 month ago

@douglatornell that certainly clearly describes what they do! I just don't know what the corresponding noun would be, vz.open_virtual_dataset(... metadata_getter=...) hardly rolls off the tongue 😅

douglatornell commented 1 month ago

Perhaps I'm not understanding the nuances of the the API.

In vz.open_virtual_dataset(... metadata_getter=...), the metadata_getter=... bit would be an item in reader_options, correct? So, why not make it

vz.open_virtual_dataset(... metadata=vz.readers.get_kerchunk_metadata, ...)

for example?

But do you need to pass a function name at that level? Why not

vz.open_virtual_dataset(... metadata="kerchunk", ...)

and do the dispatching to the appropriate vz.readers.get_*_metadata() function in vz.open_virtual_dataset() where you are handling so many of the other details of different storage formats. I suppose doing that would preclude a user-defined get_*_metadata() function, if that is a feature you want/need to support.

TomNicholas commented 1 month ago

the metadata_getter=... bit would be an item in reader_options, correct?

Not quite - reader_options is passed on to kerchunk readers, but there are other readers that are not based on kerchunk (e.g. the dmr++ reader).

a user-defined get_*_metadata() function

I do think we want to support that. There are a few common file formats that are parseable to the zarr model, plus a long tail of niche formats that are also parseable to the zarr model (see https://github.com/zarr-developers/VirtualiZarr/issues/218). We should ship readers for the common ones bundled here, but allows users to plug their own readers in.

But do you need to pass a function name at that level?

In general I feel like we should be trying to move towards an entrypoint system, analogous to xarray.open_dataset's BackendEntrypoint class, which is what is accessed when you use the engine=... kwarg to xr.open_dataset.