Open TomNicholas opened 3 months ago
On Aug 7, 2024, at 13:39, Tom Nicholas @.***> wrote:
Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) https://github.com/zarr-developers/VirtualiZarr/issues/85#issuecomment-2113222299 @jgallagher59701 https://github.com/jgallagher59701 mentioned that DMR++ can (or soon will) support it.
We use the same code to interpret the DMR++ for HDF5 and HDF4.
I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.
If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests, then presumably a reader for HDF4 directly to chunk manifests would also be possible?
Yes.
There’s quite a bit to HDF4, however, because it is a more complex format than HDF5. And, NASA’s HDF4 is not vanilla HDF4, so it has its own complexities on top of that. Bottom line, you will probably have to extend the interpreter you have, but it’s certainly possible and there is lots of data in HDF4.
HTH, James
cc @ayushnag https://github.com/ayushnag @betolink https://github.com/betolink — Reply to this email directly, view it on GitHub https://github.com/zarr-developers/VirtualiZarr/issues/216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7Q4KVNGFPHGNACJJUXNKTZQJZVZAVCNFSM6AAAAABMFBYFYOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TIMRQGM4TQOA. You are receiving this because you were mentioned.
-- James Gallagher @.***
@martindurant has an in-progress PR to kerchunk to add support for reading HDF4 directly. If that makes it in we can just call it from vz.open_virtual_dataset
, which would fully close this issue.
I should warn you, that I am working to match only specific NASA data (provided by @maxrjones ), not HDF4 in general, and I suspect that the chunks in general may be tiny.
Older data in HDF4/5 almost always has small chunks (spinning disks, low-latency, small block sizes). But that is not a big problem. Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. We call these grouped chunks 'Super Chunks.' It is an optimization that Patrick Quinn first implemented and we stumbled on later. This is far more efficient than transferring the small chunks in parallel (in general, exceptions exist).
Yes, kerchunk also joins near-contiguous chunks; the problem I actually see
Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel.
Yes, kerchunk also joins near-contiguous chunks
This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?
Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. Yes, kerchunk also joins near-contiguous chunks
This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else?
I mean concatenating the byte ranges. Often in these files the chunks lie right next to each other (for a given array).
Yes, kerchunk also joins near-contiguous chunks; the problem I actually see ...
- relatively small gains for reading only select chunks compared to grabbing the whole file every time.
That's true for files with a small number of variables. Get the whole file. If there are O(10^2) variables and only 2-3 are needed, it's faster to get just those 2-3. Again, there are exceptions.
In ReferenceFS, if you cat() with a number of references, those within a single file may be merged depending on the arguments
max_gap=64_000,
max_block=256_000_000,
For example, for references [remote://file, 10, 10] , [remote://file, 30, 10], the actual request will be bytes 10->40, if the gap is smaller than max_gap. The result is sliced into two outputs. Naturally, if max_gap=0, only truly contiguous parts are merged, and <0 for no merge at all. The requests would still be concurrent, however.
That said, we have a full interpreter in C++, which i realize is not exactly enticing for many
Why aren't we using DMR++? Is it not in good enough shape to bind to Python/R? Is there other challenges, there's plenty of C++ used seamlessly in Python and calling out to h5 libs is doing that anyway.
That sounds like the crosslang solution already ?? I only have a few HDF4 stores of interest outside of NASA, and maybe only one.
There's something I'm missing given #113 🙏 I'll keep exploring I keep finding new aspects 👌.
I'm sorry if I have done some duplication of work. I think it may be worthwhile to have a pure-python solution too, though, for the case that no dmr++ index files exist for some HDF4. Also, it has been (so far) nerdy fun, definitely work a blog post.
https://github.com/fhs/pyhdf/ also reads HDF4 and SatPy uses it to read MODIS. I'm wondering if it could be helpful for Kerchunk as well.
I wonder if Ayush'd work on VirtualiZarr has a DMR++ parser (pure python) you could use? The DMR++ builder is C++ but we actually have a DMR++ Builder web service that we can expose for HDF5 and could do the same thing for HDF4.
It would be interesting to see how close we could get to valid Kerchunk from DMR++ using a simple transform. Just a thought, I don't see myself having time for that any time soon...
My code mostly extracts the necessary zarr metadata and then creates it into a virtualizarr data structure at the end of each function. So by just modifying the last step creating a kerchunk reader is definitely possible. Also interestingly you could go dmrpp --> virtualizarr --> kerchunk since virtualizarr supports writing out to kerchunk.
However I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4
I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4
Is there no hdf4 work? It is very different.
No there isn't any hdf4 work yet. However it seems like the goal is to make the hdf4 dmrpp spec very similar to the hdf5 one which means it will require some sort of extension (as opposed to a rewrite) as James mentioned above:
Bottom line, you will probably have to extend the interpreter you have
My HDF4 branch in kerchunk is very nearly complete. Everyone welcome to look!
As for pyhdf4..., to use it, you need to have a very deep understanding of the specifics of the conventions used in a given file (maybe possible for modis) and how the C API works. If I can make my version work, I prefer pure-python.
Is this code?: https://github.com/martindurant/fsspec-reference-maker/blob/df61060869e367da9674d33962631d81ead76865/kerchunk/hdf.py#L697 seeing terms like "SDD" gave me flashbacks of the first time I opened one of these files. Thanks for all the work! can we just throw some examples at it?
Yes, that code. Please do play with it, but of course there are no guarantees.
Looks like @martindurant 's kerchunk HDF4 reader is in kerchunk main
- it's in the docs here, though perhaps not yet in a released version of kerchunk?
This means that someone could easily use it to add a VirtualiZarr HDF4 reader.
Correct, I will do a kerchunk release today.
-edit-
done
Maybe this one
https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/hdf4
But I'll look through our archives there'll be something 🙏
The specific files used for development were behind a NASA signup and accept-conditions, so I don't think we can include them for tests here.
In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.
In addition, we don't really have a baseline expectation of what the output ought to look like - loading with hdf4 for xarray requires a choice of "variable" and don't include the whole of the original datafile's contents.
Okay - in that case it would be good to better understand the relationship between the HDF4 data model and the xarray data model when creating this reader, otherwise we're going to end up with confusions similar to the tiff case (see https://github.com/zarr-developers/VirtualiZarr/issues/291#issuecomment-2480025645).
Again it would be great if someone who actually uses HDF4 wanted to have a go at a PR for this.
You may not find it satisfactory, but I think processing datasets that possibly don't fit neatly into xarray's model should be considered somewhat expert: the user needs to know some details of their data and how they expect it to turn out in a zarr form. That would include needing in some cases to specify if a thing is a array, array with coordinates, dataset or tree.
Hi, You might be interested in some of the work we're doing at the behest of NASA WRT HDF4 and HDF-EOS2. The DMR++ encoding and our interpreter now supports HDF4/EOS2 at least as far as NASA has taken it (like HDF5, there's quite a bit to the HDF4 data model, as I'm sure you know).
Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in https://github.com/zarr-developers/VirtualiZarr/issues/85#issuecomment-2113222299 @jgallagher59701 mentioned that DMR++ can (or soon will) support it.
If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests (see #85), then presumably a reader for HDF4 directly to chunk manifests would also be possible?
cc @ayushnag @betolink