zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
85 stars 16 forks source link

Reading from dmrcp index files? #85

Open TomNicholas opened 4 months ago

TomNicholas commented 4 months ago

@betolink suggested today the idea of reading from dmrcpp (DMR++) files, which are apparently a sidecar filetype that is often already included with certain NASA datasets, and basically is a chunk references format.

Could we imagine ingesting these into VirtualiZarr as another way to read byte ranges? i.e. create a ChunkManifest/ManifestArray object from the DMR++. Or maybe even imagine writing out to this filetype?

TomNicholas commented 4 months ago

The same thing was suggested on the kerchunk tracker https://github.com/fsspec/kerchunk/issues/221

TomNicholas commented 4 months ago

cc @jgallagher59701

jgallagher59701 commented 4 months ago

Short answer: Yes. DMR++ could be used to provide both the Variable data via chunk locations and also attribute name/type/value information. DMR++ supports almost all of HDF5 at this point, almost all of HDF4 and we're close to complete (but with significant optimization needed) HDF-EOS2.

Our work on DMR++ is heavily focused on NASA/ESDIS and Earth Data Cloud needs, hence the focus on HDF5, HDF4 and HDF-EOS2.

If you want to email me directly, jgallagher@opendap.org

ayushnag commented 3 months ago

I think this idea has a lot of potential especially with integration into earthaccess. I can make an adapter for dmr++ to zarr metadata. The concept has already been done here in zarr-eosdis-store but only for single file references (so no concatenation logic). Single reference creation is the most time consuming step so by skipping that, the usability of earthaccess.consolidate_metadata() is much higher

Vision: user can find a NASA collection (or specific files) with earthaccess.search(), then earthaccess.consolidate_metadata() once, and have rapid access to the entire collection. The only issue with this currently is that they need to pass kerchunk_kwargs and know what to pass to MultiZarrToZarr (most scientists probably don't want to do this). Of course the function will also be much faster now since we only need to read DMR++ files and convert them to something virtualizarr understands.

The virtualizarr workflow would be ideal since people can use the familiar xarray interface to concat and then virtualize.

betolink commented 3 months ago

@ayushnag what you described is a fantastic idea! a researcher could create an on-demand virtual datasets with 3 lines of code, and if we use the dmr++ should be pretty fast or at least faster than creating the kerchunk references directly.

In case it's useful, H. Joe Lee from the HDF Group was working on a white paper on kerchunk and created this kerchunk generation from dmr++ reference files utility and other tools, maybe the script requires some updates but it can be done (a pure format to format translation).

If this gets integrated into VirtualiZarr earthaccess could replace what we have in consolidate_metadata() with this and have use cases like "Create a virtual dataset for canopy heights over the Yucatan peninsula"

import earthaccess

# search with spatiotemporal constraints 
results = earthaccess.search_data(
    doi="10.3334/ORNLDAAC/2056",
    bounding_box=(-91.18, 17.43, -86.74, 21.65)
)

# will use Dask to translate the dmr++ references instead of kerchunk 
earthaccess.consolidate_metadata(results,  "gedi_yucatan.parquet", combine='by_coords', format="parquet")

# ... then we open it the same way 

We have bi-weekly hacking hours for earthaccess, maybe we could meet there and discuss this integration with VirtualiZarr and your ideas with the dmr++ files. Thanks for taking the lead on this!

betolink commented 3 months ago

@agoodm I know you're aware of this project, I think all the cool stuff you're prototyping would fit somewhere in VirtualiZarr, @TomNicholas and @ayushnag Alex is working on creating logical cubes for many netcdf in the ITS_LIVE project so that we can load them efficiently in xarray (or other Zarr/parquet compatible tools)

jgallagher59701 commented 3 months ago

I've been following along, but I admit, only partly because my knowledge of earthaccess and its capabilities is limited. However, @betolink thanks for calling my/our attention to Joe Lee's work with DMR++ to Zarr.

If you need/want any information of DMR++, including our forthcoming support for HDF4, please let me know.

Also, I'm very interested in learning about this bit of code and what it implies:

earthaccess.consolidate_metadata(results, "gedi_yucatan.parquet", combine='by_coords', format="parquet")

That is, are you using parquet to store the kerchunk json?

Thanks!

jgallagher59701 commented 3 months ago

@TomNicholas One thing I should mention is that the DMR++ does not have to be co-located with the data it describes.

While NASA uses the DMR++ documents as sidecar files - they are objects with the same base name as the data files they describe and reside in the same S3 bucket - that is merely how NASA uses them. The DMR++ holds an href to the data it describes and, overriding that 'global' href, can include hrefs to the granularity of each individual chunk of data. Thus, the DMR++ does not have to be a true sidecar file and is not limited to data values from one particular source (or format, for that matter).

Our reader supports both hrefs for HTTP/S and the local file system.

TomNicholas commented 3 months ago

Also, I'm very interested in learning about this bit of code and what it implies:

earthaccess.consolidate_metadata(results,  "gedi_yucatan.parquet", combine='by_coords', format="parquet")

That is, are you using parquet to store the kerchunk json?

I'm assuming this snippet is aspirational, as it seems to mix something kerchunk can do (write references as parquet[^1]) with the syntax that VirtualiZarr is aiming for (combine='by_coords', like xarray.open_mfdataset).

The DMR++ holds an href to the data it describes and, overriding that 'global' href, can include hrefs to the granularity of each individual chunk of data.

That sounds very close to the chunk manifest (manifest.json) proposal for zarr (see https://github.com/zarr-developers/zarr-specs/issues/287), where you can store a different path (local or remote) for each chunk

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},  
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, 
}

So hopefully the translation between DMR++ and zarr chunk manifests should be smooth.

[^1]: VirtualiZarr could also write kerchunk-format references to parquet - the issue to track that feature request is #72.

betolink commented 3 months ago

Hi @jgallagher59701 Tom is right, the snippet is aspirational. Once these features get integrated into VirtualiZarr we will leverage them in earthaccess to create on-demand virtual datasets, I see this as having a client side Opendap server (in a way).

Supporting HDF EOS/ HDF4 will be fantastic! this is one of the main bottlenecks for a lot of workflows (MODIS, VIRS etc etc) and yes the serialization of these potentially big cubes will perform better in formats like Parquet but the content will be the same as the Kerchunk jsons or zarr chunk manifests. I think we all should talk to expand this collaboration!

NOTE: we already have this in earthaccess but we use kerchunk, with this dmr++ to chunk manifest translation we will only use kerchunk if there is no dmr++ available for a given dataset.

jgallagher59701 commented 3 months ago

@betolink Should we focus on adding the ability to process dmr++ in earthaccess as an initial first step?

I don't have much in the way of cycles this quarter - we're down two developers, but I can answer questions about dmr++.

Also, as an additional thing to work on, we could investigate ways to store/access the dmr++ that would allow faster access to the needed information for a given variable (and possibly elide the boulky attribute information until/if it is needed). Doing that seems like it would provide one way to make your aspirational API.

betolink commented 3 months ago

I think this adapter should live in VirtualiZarr, @ayushnag is planning on starting it soon I believe (thanks Ayush!!). From earthaccess we only need a simple method to verify that there is a dmr++ available for a dataset and then pass the list of the files and what kind of logical concatenation we want from VirtualiZarr. I'm sure we'll have questions about dmr++!

I'm not sure what do you mean by storing the dmr++, are you talking about the resulting logical cubes or the original dmr++ files?

jgallagher59701 commented 3 months ago

WRT 'store the dmr++' I was thinking about the achilles heal of dmr++ – the documents can be quite large. Currently, we move the whole thing out of S3 and into memory to extract information. One option I'd like to work on is sharding the dmr++.

There are several ways to determine if a dmr++ exists for a given file. One way is to ask CMR for information about the 'granule.' The dmr++ is not explicitly mentioned, but if a 'related URL' for OPeNDAP is present for data in the cloud, then a dmr++ can be found using the OPeNDAP related URL and appending '.dmrpp' I can help with the specifics.

This might be tangential, but are you planning on attending ESIP this summer? The meeting is in late July; it might provide a good venue to talk about these ideas with many of the NASA EDC folks. I will be there as will @Mikejmnez.

betolink commented 3 months ago

Unfortunately I'm not going to ESIP this time, I'm sure someone from ITS_LIVE/ NASA Openscapes will be there. @jgallagher59701 Thanks for helping with this!

ayushnag commented 3 months ago

I have made a basic version of the dmr parser here. This can read the dimensions, coordinates, variables (including chunk manifest data, encoding, etc.), and attrs to create an xr.Dataset

In [1]: from virtualizarr import open_virtual_dataset
In [2]: print(open_virtual_dataset("20210715090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.dmrpp", filetype='dmr++'))
Out [2]: <xarray.Dataset> Size: 6GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
    time              (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
    lat               (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
    lon               (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
    mask              (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    sea_ice_fraction  (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    dt_1km_data       (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
    analysed_sst      (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    analysis_error    (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
    sst_anomaly       (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (2/47)
    Conventions:                CF-1.7
    title:                      Daily MUR SST, Final product

Some next steps:

TomNicholas commented 3 months ago

This is great @ayushnag ! If you open a PR we can add specific code comments there.

I didn't want to add a new dependency yet

We would add all the dependencies necessary to read dmr++ files as optional dependencies to virtualizarr anyway.

Support for netcdf groups and hdf5 files

  • Will probably be a bit later on since that will need some xarray-datatree integration

Reading from a specific group via a new group kwarg to open_virtual_dataset doesn't require datatree.

but still xarray tries to auto-create indexes

If you can reproduce this for me that would be super helpful. I'm happy to help try with trying to make a reproducer because I want to proactively find all the ways that xarray might try to create indexes.

Mikejmnez commented 3 months ago

This is great, thanks @ayushnag ! @jgallagher59701 is out this week but I can help answering some of these questions for now.

  • Check compatibility with potential previous dap and dmrpp versions

    @jgallagher59701 are there dmrpp files older than DAP4.0? If so, is there a schema or example file available for this? So > far I have been using this spec and assuming that all > the files we can use will be DAP4.0 and dmr1.0

DAP2.0, the previous version to DAP4, did not use dmrpp. But DAP2 $\subset$ DAP4, and a dmr file can be created from the command line for each netcdf file even though it didn't have a dmrpp before (some older files may only have a dds, which holds the metadata info in DAP2). So there is no real need to check dmrpp files for older than DAP4.0.

Do I need to consider spec changes with different dmrpp versions such as dmrpp:version="3.20.13-664" vs > dmrpp:version="3.20.13-310"?

The short answer is, yes, and dmrpp is actively being developed (I believe the newest versions, which have some optimization features, are not well documented either). That said, a lot of the new development in DMR++ is largely associated with hdf5, groups, etc. And then there's the idea of sharding dmrpp files. All things downstream in the development road for the various collective efforts going on. I think these are all things we can revisit later.

jgallagher59701 commented 3 months ago

As @Mikejmnez says, you can use the DMR++, via the Hyrax server, to build DAP2 responses. But, for many newer datasets, that's not really very useful since those datasets use data types not supported by DAP2 (the most common types 'missing' from DAP2 are Groups and Int64s). We are really trying to move beyond DAP2.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.

jgallagher59701 commented 2 months ago

@ayushnag @betolink I mentioned a while back that I was going to ESIP this summer and could present on this - to increase visibility inside NASA, etc. Plus, it's really interesting. Here's the rub, they need slides this week - well, last week, but I got an extension. Do you have time to talk with me about this? Do you think a short (standard 15min talk) on this is a good idea at this time or is it premature? Let me know.

TomNicholas commented 2 months ago

@jgallagher59701 perhaps the slides I wrote on VirtualiZarr for the Pangeo showcase last week might be useful?

https://discourse.pangeo.io/t/pangeo-showcase-virtualizarr-create-virtual-zarr-stores-using-xarray-syntax/4127/2

Also unrelated but it would be great if one of you guys could fill out this survey in support of a grant application to NASA to support Zarr:

https://discourse.pangeo.io/t/nasa-zarr-survey/4254

ayushnag commented 2 months ago

@jgallagher59701 Sure I can meet this week and maybe we can figure out if there is enough content for a talk. I do have some results with the basic parser functionality and a simple earthaccess function integrated.

Screenshot 2024-05-22 at 2 55 45 PM
jgallagher59701 commented 2 months ago

@ayushnag @TomNicholas The slides are good and the screen shot looks very interesting. Lets talk tomorrow (May 30th). For this talk, I should probably try to explain DMR++ and how it fits into the VirtualZarr work. But I also want to raise interest in VirtualZarr within ESDIS (NASA's Earth Science Dats and Information System). I'm free from 2pm MDT onward.

Mikejmnez commented 2 months ago

@ayushnag @TomNicholas The slides are good and the screen shot looks very interesting. Lets talk tomorrow (May 30th). For this talk, I should probably try to explain DMR++ and how it fits into the VirtualZarr work. But I also want to raise interest in VirtualZarr within ESDIS (NASA's Earth Science Dats and Information System). I'm free from 2pm MDT onward.

This sounds like a great meeting I sadly won't be able to join - I will spend all day at NCAR. I will catch up next!

betolink commented 2 months ago

Hi @jgallagher59701 unfortunately I have another meeting at that hour, I think it'll be great to show this at ESIP!

@ayushnag one thing I forgot to ask, what would happen if we don't want to combine the references in one go? I'm thinking this could be another workflow

import earthaccess

results = earthaccess.search_data(**kwargs)
vds = earthaccess.open_virtual_dataset(results, concat_dim=None)
# Will vds be a list of individual references? if so, we could pre-process them and then use xarray for the concatenation
ds = xr.open_mfdataset(vds, preprocess=some_func, parallel=True)

EDIT: @ayushnag nvm, I think since you're already returning DataArrays the xr.open_mfdatasets doesn't apply. So then my question is, can we just pass preprocess (or other xarray args) to the open_virtual_dataset() method? how that is handled? (probably you already explained this to me once 😄 )

TomNicholas commented 2 months ago

I will spend all day at NCAR

At the Research Data Commons workshop? I'm there today so we should chat 😄

what would happen if we don't want to combine the references in one go?

I might be missing something about how DMR++ works, but I suggest trying to follow the pattern of open_virtual_dataset opening a single reference file (/granule??), and leave any combining of multiple datasets to a second call to xr.concat. The two can be wrapped up together in a separate open_virtual_mfdataset, but I don't think a function named open_virtual_dataset should have kwargs like concat_dim that xr.open_dataset doesn't. Keeping this separation would make it easier to write an xarray backend entrypoint (#35) if we go down that route.

betolink commented 2 months ago

I completely agree with your suggestion @TomNicholas

ayushnag commented 2 months ago

Thanks for the suggestions everyone! I originally wanted to test open_mfdataset functionality and just called the function open_virtual_dataset but I totally agree there should be an open_virtual and open_virtual_mf. @TomNicholas is it an issue that they are named similarly to virtualizarr open_virtual_dataset which has similar functionality but reads netcdf's and outputs ManifestArrays? The earthaccess version will not visit the netcdf and also does not output a dataset with ManifestArrays.

Also I will open an issue+PR in earthaccess to continue this discussion there.

TomNicholas commented 2 months ago

is it an issue that they are named similarly to virtualizarr open_virtual_dataset which has similar functionality but reads netcdf's and outputs ManifestArrays? The earthaccess version will not visit the netcdf and also does not output a dataset with ManifestArrays.

I'm now a bit confused what these earthaccess functions actually do. Is this correct:

If this is the case then I would suggest you either rename the second function to simply open_dataset, or even better make an earthaccess xarray backendentrypoint, so that users can call xr.open_dataset(path, engine='earthaccess'). The point being that open_virtual_dataset is supposed to mean "open some archival files as a set of virtual ManifestArrays", not "open a set of virtual kerchunk-like reference files as numpy/dask arrays".

Also I will open an issue+PR in earthaccess to continue this discussion there.

:+1:

betolink commented 2 months ago

make an earthaccess xarray backendentrypoint

A while ago Ryan Abernathey suggested this; I like the idea but it would be problematic as NASA datasets are not uniform etc, plus I see backend as a way to deal with file formats for our array info and this would be more than that.

Going back to what would be nice to have as a top level API, I think mapping what we have with xarray would be nice,

vds = earthaccess.open_virtual_dataset(cmr_result)

and for multiple results we could keep the current functionality with just a name change

vds = earthaccess.open_virtual_mfdataset(cmr_results, concat_dim="Time", coords="minimal", ...)
Mikejmnez commented 2 months ago

I will spend all day at NCAR

At the Research Data Commons workshop? I'm there today so we should chat 😄

@TomNicholas I am glad we meet in person :) and had some time to chat.