-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading from dmrcp index files? #85
Comments
The same thing was suggested on the kerchunk tracker fsspec/kerchunk#221 |
Short answer: Yes. DMR++ could be used to provide both the Variable data via chunk locations and also attribute name/type/value information. DMR++ supports almost all of HDF5 at this point, almost all of HDF4 and we're close to complete (but with significant optimization needed) HDF-EOS2. Our work on DMR++ is heavily focused on NASA/ESDIS and Earth Data Cloud needs, hence the focus on HDF5, HDF4 and HDF-EOS2. If you want to email me directly, [email protected] |
I think this idea has a lot of potential especially with integration into Vision: user can find a NASA collection (or specific files) with The |
@ayushnag what you described is a fantastic idea! a researcher could create an on-demand virtual datasets with 3 lines of code, and if we use the dmr++ should be pretty fast or at least faster than creating the kerchunk references directly. In case it's useful, H. Joe Lee from the HDF Group was working on a white paper on kerchunk and created this kerchunk generation from dmr++ reference files utility and other tools, maybe the script requires some updates but it can be done (a pure format to format translation). If this gets integrated into import earthaccess
# search with spatiotemporal constraints
results = earthaccess.search_data(
doi="10.3334/ORNLDAAC/2056",
bounding_box=(-91.18, 17.43, -86.74, 21.65)
)
# will use Dask to translate the dmr++ references instead of kerchunk
earthaccess.consolidate_metadata(results, "gedi_yucatan.parquet", combine='by_coords', format="parquet")
# ... then we open it the same way We have bi-weekly hacking hours for earthaccess, maybe we could meet there and discuss this integration with VirtualiZarr and your ideas with the dmr++ files. Thanks for taking the lead on this! |
@agoodm I know you're aware of this project, I think all the cool stuff you're prototyping would fit somewhere in VirtualiZarr, @TomNicholas and @ayushnag Alex is working on creating logical cubes for many netcdf in the ITS_LIVE project so that we can load them efficiently in xarray (or other Zarr/parquet compatible tools) |
I've been following along, but I admit, only partly because my knowledge of earthaccess and its capabilities is limited. However, @betolink thanks for calling my/our attention to Joe Lee's work with DMR++ to Zarr. If you need/want any information of DMR++, including our forthcoming support for HDF4, please let me know. Also, I'm very interested in learning about this bit of code and what it implies:
That is, are you using parquet to store the kerchunk json? Thanks! |
@TomNicholas One thing I should mention is that the DMR++ does not have to be co-located with the data it describes. While NASA uses the DMR++ documents as sidecar files - they are objects with the same base name as the data files they describe and reside in the same S3 bucket - that is merely how NASA uses them. The DMR++ holds an href to the data it describes and, overriding that 'global' href, can include hrefs to the granularity of each individual chunk of data. Thus, the DMR++ does not have to be a true sidecar file and is not limited to data values from one particular source (or format, for that matter). Our reader supports both hrefs for HTTP/S and the local file system. |
I'm assuming this snippet is aspirational, as it seems to mix something kerchunk can do (write references as parquet1) with the syntax that VirtualiZarr is aiming for (
That sounds very close to the chunk manifest ( {
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
} So hopefully the translation between DMR++ and zarr chunk manifests should be smooth. Footnotes
|
Hi @jgallagher59701 Tom is right, the snippet is aspirational. Once these features get integrated into VirtualiZarr we will leverage them in earthaccess to create on-demand virtual datasets, I see this as having a client side Opendap server (in a way). Supporting HDF EOS/ HDF4 will be fantastic! this is one of the main bottlenecks for a lot of workflows (MODIS, VIRS etc etc) NOTE: we already have this in earthaccess but we use kerchunk, with this dmr++ to chunk manifest translation we will only use kerchunk if there is no dmr++ available for a given dataset. |
@betolink Should we focus on adding the ability to process dmr++ in earthaccess as an initial first step? I don't have much in the way of cycles this quarter - we're down two developers, but I can answer questions about dmr++. Also, as an additional thing to work on, we could investigate ways to store/access the dmr++ that would allow faster access to the needed information for a given variable (and possibly elide the boulky attribute information until/if it is needed). Doing that seems like it would provide one way to make your aspirational API. |
I think this adapter should live in VirtualiZarr, @ayushnag is planning on starting it soon I believe (thanks Ayush!!). From I'm not sure what do you mean by storing the dmr++, are you talking about the resulting logical cubes or the original dmr++ files? |
WRT 'store the dmr++' I was thinking about the achilles heal of dmr++ – the documents can be quite large. Currently, we move the whole thing out of S3 and into memory to extract information. One option I'd like to work on is sharding the dmr++. There are several ways to determine if a dmr++ exists for a given file. One way is to ask CMR for information about the 'granule.' The dmr++ is not explicitly mentioned, but if a 'related URL' for OPeNDAP is present for data in the cloud, then a dmr++ can be found using the OPeNDAP related URL and appending '.dmrpp' I can help with the specifics. This might be tangential, but are you planning on attending ESIP this summer? The meeting is in late July; it might provide a good venue to talk about these ideas with many of the NASA EDC folks. I will be there as will @Mikejmnez. |
Unfortunately I'm not going to ESIP this time, I'm sure someone from ITS_LIVE/ NASA Openscapes will be there. @jgallagher59701 Thanks for helping with this! |
I have made a basic version of the dmr parser here. This can read the In [1]: from virtualizarr import open_virtual_dataset
In [2]: print(open_virtual_dataset("20210715090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.dmrpp", filetype='dmr++'))
Out [2]: <xarray.Dataset> Size: 6GB
Dimensions: (time: 1, lat: 17999, lon: 36000)
Coordinates:
time (time) int32 4B ManifestArray<shape=(1,), dtype=int32, ...
lat (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
lon (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
mask (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
sea_ice_fraction (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
dt_1km_data (time, lat, lon) int8 648MB ManifestArray<shape=(1, 179...
analysed_sst (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
analysis_error (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
sst_anomaly (time, lat, lon) int16 1GB ManifestArray<shape=(1, 1799...
Attributes: (2/47)
Conventions: CF-1.7
title: Daily MUR SST, Final product Some next steps:
|
This is great @ayushnag ! If you open a PR we can add specific code comments there.
We would add all the dependencies necessary to read dmr++ files as optional dependencies to virtualizarr anyway.
Reading from a specific group via a new
If you can reproduce this for me that would be super helpful. I'm happy to help try with trying to make a reproducer because I want to proactively find all the ways that xarray might try to create indexes. |
This is great, thanks @ayushnag ! @jgallagher59701 is out this week but I can help answering some of these questions for now.
DAP2.0, the previous version to DAP4, did not use dmrpp. But DAP2
The short answer is, yes, and |
As @Mikejmnez says, you can use the DMR++, via the Hyrax server, to build DAP2 responses. But, for many newer datasets, that's not really very useful since those datasets use data types not supported by DAP2 (the most common types 'missing' from DAP2 are Groups and Int64s). We are really trying to move beyond DAP2. I should add to the above that many of the newer features in DMR++ are there to support HDF4 - yes, '4' - and that requires some hackery in the interpreter. Look at how can now contain elements. In HDF4, a 'chunk' is not necessarily atomic. Also complicating the development of an interpreter is the use of fill values in both HDF4 and HDF5, even for scalar variables. That said, we have a full interpreter in C++, which i realize is not exactly enticing for many ;-), but that means there is code for this and this 'documentation' is 'verifiable' since it's running code. |
@ayushnag @betolink I mentioned a while back that I was going to ESIP this summer and could present on this - to increase visibility inside NASA, etc. Plus, it's really interesting. Here's the rub, they need slides this week - well, last week, but I got an extension. Do you have time to talk with me about this? Do you think a short (standard 15min talk) on this is a good idea at this time or is it premature? Let me know. |
@jgallagher59701 perhaps the slides I wrote on VirtualiZarr for the Pangeo showcase last week might be useful? Also unrelated but it would be great if one of you guys could fill out this survey in support of a grant application to NASA to support Zarr: |
@jgallagher59701 Sure I can meet this week and maybe we can figure out if there is enough content for a talk. I do have some results with the basic parser functionality and a simple |
@ayushnag @TomNicholas The slides are good and the screen shot looks very interesting. Lets talk tomorrow (May 30th). For this talk, I should probably try to explain DMR++ and how it fits into the VirtualZarr work. But I also want to raise interest in VirtualZarr within ESDIS (NASA's Earth Science Dats and Information System). I'm free from 2pm MDT onward. |
This sounds like a great meeting I sadly won't be able to join - I will spend all day at NCAR. I will catch up next! |
Hi @jgallagher59701 unfortunately I have another meeting at that hour, I think it'll be great to show this at ESIP! @ayushnag one thing I forgot to ask, what would happen if we don't want to combine the references in one go? I'm thinking this could be another workflow import earthaccess
results = earthaccess.search_data(**kwargs)
vds = earthaccess.open_virtual_dataset(results, concat_dim=None)
# Will vds be a list of individual references? if so, we could pre-process them and then use xarray for the concatenation
ds = xr.open_mfdataset(vds, preprocess=some_func, parallel=True) EDIT: @ayushnag nvm, I think since you're already returning DataArrays the |
At the Research Data Commons workshop? I'm there today so we should chat 😄
I might be missing something about how DMR++ works, but I suggest trying to follow the pattern of |
I completely agree with your suggestion @TomNicholas |
Thanks for the suggestions everyone! I originally wanted to test Also I will open an issue+PR in |
I'm now a bit confused what these
If this is the case then I would suggest you either rename the second function to simply
👍 |
A while ago Ryan Abernathey suggested this; I like the idea but it would be problematic as NASA datasets are not uniform etc, plus I see backend as a way to deal with file formats for our array info and this would be more than that. Going back to what would be nice to have as a top level API, I think mapping what we have with xarray would be nice, vds = earthaccess.open_virtual_dataset(cmr_result) and for multiple results we could keep the current functionality with just a name change vds = earthaccess.open_virtual_mfdataset(cmr_results, concat_dim="Time", coords="minimal", ...) |
@TomNicholas I am glad we meet in person :) and had some time to chat. |
@betolink suggested today the idea of reading from
dmrcpp
(DMR++) files, which are apparently a sidecar filetype that is often already included with certain NASA datasets, and basically is a chunk references format.Could we imagine ingesting these into
VirtualiZarr
as another way to read byte ranges? i.e. create aChunkManifest
/ManifestArray
object from theDMR++
. Or maybe even imagine writing out to this filetype?The text was updated successfully, but these errors were encountered: