-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HDF4? #216
Comments
On Aug 7, 2024, at 13:39, Tom Nicholas ***@***.***> wrote:
Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) <#85 (comment)> @jgallagher59701 <https://github.com/jgallagher59701> mentioned that DMR++ can (or soon will) support it.
We use the same code to interpret the DMR++ for HDF5 and HDF4.
I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code.
If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests, then presumably a reader for HDF4 directly to chunk manifests would also be possible?
Yes.
There’s quite a bit to HDF4, however, because it is a more complex format than HDF5. And, NASA’s HDF4 is not vanilla HDF4, so it has its own complexities on top of that. Bottom line, you will probably have to extend the interpreter you have, but it’s certainly possible and there is lots of data in HDF4.
HTH,
James
… cc @ayushnag <https://github.com/ayushnag> @betolink <https://github.com/betolink>
—
Reply to this email directly, view it on GitHub <#216>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB7Q4KVNGFPHGNACJJUXNKTZQJZVZAVCNFSM6AAAAABMFBYFYOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TIMRQGM4TQOA>.
You are receiving this because you were mentioned.
--
James Gallagher
***@***.***
|
@martindurant has an in-progress PR to kerchunk to add support for reading HDF4 directly. If that makes it in we can just call it from |
I should warn you, that I am working to match only specific NASA data (provided by @maxrjones ), not HDF4 in general, and I suspect that the chunks in general may be tiny. |
Older data in HDF4/5 almost always has small chunks (spinning disks, low-latency, small block sizes). But that is not a big problem. Group the contiguous chunks and transfer them in a single I/O operation and then decompress them in parallel. We call these grouped chunks 'Super Chunks.' It is an optimization that Patrick Quinn first implemented and we stumbled on later. This is far more efficient than transferring the small chunks in parallel (in general, exceptions exist). |
Yes, kerchunk also joins near-contiguous chunks; the problem I actually see
|
This is something interesting that I've not heard about before. By "grouping" or "joining" do you mean literally concatenating the byte ranges together? Or something else? |
I mean concatenating the byte ranges. Often in these files the chunks lie right next to each other (for a given array). |
That's true for files with a small number of variables. Get the whole file. If there are O(10^2) variables and only 2-3 are needed, it's faster to get just those 2-3. Again, there are exceptions. |
In ReferenceFS, if you cat() with a number of references, those within a single file may be merged depending on the arguments
For example, for references [remote://file, 10, 10] , [remote://file, 30, 10], the actual request will be bytes 10->40, if the gap is smaller than max_gap. The result is sliced into two outputs. |
Why aren't we using DMR++? Is it not in good enough shape to bind to Python/R? Is there other challenges, there's plenty of C++ used seamlessly in Python and calling out to h5 libs is doing that anyway. That sounds like the crosslang solution already ?? I only have a few HDF4 stores of interest outside of NASA, and maybe only one. |
There's something I'm missing given #113 🙏 I'll keep exploring I keep finding new aspects 👌. |
I'm sorry if I have done some duplication of work. I think it may be worthwhile to have a pure-python solution too, though, for the case that no dmr++ index files exist for some HDF4. Also, it has been (so far) nerdy fun, definitely work a blog post. |
https://github.com/fhs/pyhdf/ also reads HDF4 and SatPy uses it to read MODIS. I'm wondering if it could be helpful for Kerchunk as well. |
I wonder if Ayush'd work on VirtualiZarr has a DMR++ parser (pure python) you could use? The DMR++ builder is C++ but we actually have a DMR++ Builder web service that we can expose for HDF5 and could do the same thing for HDF4. It would be interesting to see how close we could get to valid Kerchunk from DMR++ using a simple transform. Just a thought, I don't see myself having time for that any time soon... |
My code mostly extracts the necessary zarr metadata and then creates it into a virtualizarr data structure at the end of each function. So by just modifying the last step creating a kerchunk reader is definitely possible. Also interestingly you could go dmrpp --> virtualizarr --> kerchunk since virtualizarr supports writing out to kerchunk. However I have only developed and tested for netcdf4 and hdf5 so there will certainly be some work needed to support hdf4 |
Is there no hdf4 work? It is very different. |
No there isn't any hdf4 work yet. However it seems like the goal is to make the hdf4 dmrpp spec very similar to the hdf5 one which means it will require some sort of extension (as opposed to a rewrite) as James mentioned above:
|
My HDF4 branch in kerchunk is very nearly complete. Everyone welcome to look! As for pyhdf4..., to use it, you need to have a very deep understanding of the specifics of the conventions used in a given file (maybe possible for modis) and how the C API works. If I can make my version work, I prefer pure-python. |
Is this code?: https://github.com/martindurant/fsspec-reference-maker/blob/df61060869e367da9674d33962631d81ead76865/kerchunk/hdf.py#L697 seeing terms like "SDD" gave me flashbacks of the first time I opened one of these files. Thanks for all the work! can we just throw some examples at it? |
Yes, that code. Please do play with it, but of course there are no guarantees. |
Looks like @martindurant 's kerchunk HDF4 reader is in kerchunk This means that someone could easily use it to add a VirtualiZarr HDF4 reader. |
Correct, I will do a kerchunk release today. -edit- done |
Thanks @martindurant ! Does someone have a small example HDF4 file we could use in VirtualiZarr's tests? It doesn't look like either of the PRs ((1), (2)) adding the HDF4 reader to kerchunk contain any tests... |
Could we support generating chunk manifests pointing to HDF4 files too? I know nothing about this format, but in #85 (comment) @jgallagher59701 mentioned that DMR++ can (or soon will) support it.
If DMR++ can index HDF4, and DMR++ can be translated to zarr chunk manifests (see #85), then presumably a reader for HDF4 directly to chunk manifests would also be possible?
cc @ayushnag @betolink
The text was updated successfully, but these errors were encountered: