zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
121 stars 24 forks source link

Support HDF4? #216

Open TomNicholas opened 3 months ago

TomNicholas commented 3 months ago

Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in https://github.com/zarr-developers/VirtualiZarr/issues/85#issuecomment-2113222299 @jgallagher59701 mentioned that DMR++ can (or soon will) support it.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.

If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests (see #85), then presumably a reader for HDF4 directly to chunk manifests would also be possible?

cc @ayushnag @betolink

jgallagher59701 commented 3 months ago

On Aug 7, 2024, at 13:39, Tom Nicholas @.***> wrote:

Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) https://github.com/zarr-developers/VirtualiZarr/issues/85#issuecomment-2113222299 @jgallagher59701 https://github.com/jgallagher59701 mentioned that DMR++ can (or soon will) support it.

We use the same code to interpret the DMR++ for HDF5 and HDF4.

I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.

If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests, then presumably a reader for HDF4 directly to chunk manifests would also be possible?

Yes.

There’s quite a bit to HDF4, however, because it is a more complex format than HDF5. And, NASA’s HDF4 is not vanilla HDF4, so it has its own complexities on top of that. Bottom line, you will probably have to extend the interpreter you have, but it’s certainly possible and there is lots of data in HDF4.

HTH, James

cc @ayushnag https://github.com/ayushnag @betolink https://github.com/betolink — Reply to this email directly, view it on GitHub https://github.com/zarr-developers/VirtualiZarr/issues/216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7Q4KVNGFPHGNACJJUXNKTZQJZVZAVCNFSM6AAAAABMFBYFYOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TIMRQGM4TQOA. You are receiving this because you were mentioned.

-- James Gallagher @.***

TomNicholas commented 3 months ago

@martindurant has an in-progress PR to kerchunk to add support for reading HDF4 directly. If that makes it in we can just call it from vz.open_virtual_dataset, which would fully close this issue.

martindurant commented 3 months ago

I should warn you, that I am working to match only specific NASA data (provided by @maxrjones ), not HDF4 in general, and I suspect that the chunks in general may be tiny.

jgallagher59701 commented 3 months ago

Older data in HDF4/5 almost always has small chunks (spinning disks, low-latency, small block sizes). But that is not a big problem. Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. We call these grouped chunks 'Super Chunks.' It is an optimization that Patrick Quinn first implemented and we stumbled on later. This is far more efficient than transferring the small chunks in parallel (in general, exceptions exist).

martindurant commented 3 months ago

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see

TomNicholas commented 3 months ago

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.

Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

jgallagher59701 commented 3 months ago

Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. Yes, kerchunk also joins near-contiguous chunks

This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?

I mean concatenating the byte ranges. Often in these files the chunks lie right next to each other (for a given array).

jgallagher59701 commented 3 months ago

Yes, kerchunk also joins near-contiguous chunks; the problem I actually see ...

  • relatively small gains for reading only select chunks compared to grabbing the whole file every time.

That's true for files with a small number of variables. Get the whole file. If there are O(10^2) variables and only 2-3 are needed, it's faster to get just those 2-3. Again, there are exceptions.

martindurant commented 3 months ago

In ReferenceFS, if you cat() with a number of references, those within a single file may be merged depending on the arguments

        max_gap=64_000,
        max_block=256_000_000,

For example, for references [remote://file, 10, 10] , [remote://file, 30, 10], the actual request will be bytes 10->40, if the gap is smaller than max_gap. The result is sliced into two outputs. Naturally, if max_gap=0, only truly contiguous parts are merged, and <0 for no merge at all. The requests would still be concurrent, however.

mdsumner commented 3 months ago

That said, we have a full interpreter in C++, which i realize is not exactly enticing for many

Why aren't we using DMR++? Is it not in good enough shape to bind to Python/R? Is there other challenges, there's plenty of C++ used seamlessly in Python and calling out to h5 libs is doing that anyway.

That sounds like the crosslang solution already ?? I only have a few HDF4 stores of interest outside of NASA, and maybe only one.

mdsumner commented 3 months ago

There's something I'm missing given #113 🙏 I'll keep exploring I keep finding new aspects 👌.

martindurant commented 3 months ago

I'm sorry if I have done some duplication of work. I think it may be worthwhile to have a pure-python solution too, though, for the case that no dmr++ index files exist for some HDF4. Also, it has been (so far) nerdy fun, definitely work a blog post.

maxrjones commented 3 months ago

https://github.com/fhs/pyhdf/ also reads HDF4 and SatPy uses it to read MODIS. I'm wondering if it could be helpful for Kerchunk as well.

jgallagher59701 commented 2 months ago

I wonder if Ayush'd work on VirtualiZarr has a DMR++ parser (pure python) you could use? The DMR++ builder is C++ but we actually have a DMR++ Builder web service that we can expose for HDF5 and could do the same thing for HDF4.

It would be interesting to see how close we could get to valid Kerchunk from DMR++ using a simple transform. Just a thought, I don't see myself having time for that any time soon...

ayushnag commented 2 months ago

My code mostly extracts the necessary zarr metadata and then creates it into a virtualizarr data structure at the end of each function. So by just modifying the last step creating a kerchunk reader is definitely possible. Also interestingly you could go dmrpp --> virtualizarr --> kerchunk since virtualizarr supports writing out to kerchunk.

However I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

martindurant commented 2 months ago

I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4

Is there no hdf4 work? It is very different.

ayushnag commented 2 months ago

No there isn't any hdf4 work yet. However it seems like the goal is to make the hdf4 dmrpp spec very similar to the hdf5 one which means it will require some sort of extension (as opposed to a rewrite) as James mentioned above:

Bottom line, you will probably have to extend the interpreter you have

martindurant commented 2 months ago

My HDF4 branch in kerchunk is very nearly complete. Everyone welcome to look!

As for pyhdf4..., to use it, you need to have a very deep understanding of the specifics of the conventions used in a given file (maybe possible for modis) and how the C API works. If I can make my version work, I prefer pure-python.

betolink commented 2 months ago

Is this code?: https://github.com/martindurant/fsspec-reference-maker/blob/df61060869e367da9674d33962631d81ead76865/kerchunk/hdf.py#L697 seeing terms like "SDD" gave me flashbacks of the first time I opened one of these files. Thanks for all the work! can we just throw some examples at it?

martindurant commented 2 months ago

Yes, that code. Please do play with it, but of course there are no guarantees.

TomNicholas commented 4 days ago

Looks like @martindurant 's kerchunk HDF4 reader is in kerchunk main - it's in the docs here, though perhaps not yet in a released version of kerchunk?

This means that someone could easily use it to add a VirtualiZarr HDF4 reader.

martindurant commented 4 days ago

Correct, I will do a kerchunk release today.

-edit-

done

TomNicholas commented 4 days ago

Thanks @martindurant !

Does someone have a small example HDF4 file we could use in VirtualiZarr's tests? It doesn't look like either of the PRs ((1), (2)) adding the HDF4 reader to kerchunk contain any tests...

mdsumner commented 4 days ago

Maybe this one

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/hdf4

But I'll look through our archives there'll be something 🙏

mdsumner commented 4 days ago

These localised sea ice grids are pretty small:

https://data.seaice.uni-bremen.de/databrowser/#day=1&month=9&year=2024&img=%7B%22image%22%3A%22image-1%22%2C%22product%22%3A%22AMSR%22%2C%22type%22%3A%22visual%22%2C%22region%22%3A%22Arctichesky%22%7D

martindurant commented 1 day ago

The specific files used for development were behind a NASA signup and accept-conditions, so I don't think we can include them for tests here.

In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.

TomNicholas commented 1 day ago

In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.

Okay - in that case it would be good to better understand the relationship between the HDF4 data model and the xarray data model when creating this reader, otherwise we're going to end up with confusions similar to the tiff case (see https://github.com/zarr-developers/VirtualiZarr/issues/291#issuecomment-2480025645).

Again it would be great if someone who actually uses HDF4 wanted to have a go at a PR for this.

martindurant commented 1 day ago

You may not find it satisfactory, but I think processing datasets that possibly don't fit neatly into xarray's model should be considered somewhat expert: the user needs to know some details of their data and how they expect it to turn out in a zarr form. That would include needing in some cases to specify if a thing is a array, array with coordinates, dataset or tree.

jgallagher59701 commented 11 hours ago

Hi, You might be interested in some of the work we're doing at the behest of NASA WRT HDF4 and HDF-EOS2. The DMR++ encoding and our interpreter now supports HDF4/EOS2 at least as far as NASA has taken it (like HDF5, there's quite a bit to the HDF4 data model, as I'm sure you know).