-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr extension for stacked / concatenated virtual views #288
Comments
I want to make a more concrete proposal of how a ZEP for concatenation might work, based on some discussions we had in-person during the NYC Zarr Sprint last week, and building off this suggestion #287 (comment). AbstractProvides a way to define arrays in a zarr store as concatenations of other arrays in the store. Motivation and ScopeA common problem is providing a zarr-compliant interface to sets of existing files, often provided in some legacy file format (e.g. netCDF4). For instance imagine a series of netCDF4 files containing daily meteorological data, written out as one file per month. In general, these netCDF4 files might have different codecs (as compression schemes and parameters change) and variable lengths along the concatenation dimension The "chunk manifest" proposal (#287) deals with how we could provide access to the bytes in the legacy files from a zarr store interface, via defining a storage transformer for zarr arrays that redirects requests to fetch specific byte ranges within the netCDF files. (For access to cloud URLs we would also need ZEP008) However, with one chunk manifest per zarr array (as proposed in #287), and each zarr array conventionally having one set of codecs (and regular chunks), a single standard zarr array cannot represent all the chunks of one variable across the entire time dimension of this hypothetical dataset. Instead, we propose defining a way to represent concatenation of a series of arrays containing the data in each file, and exposing that concatenated array as part of the store. Usage and ImpactWith this ZEP implementations would be able to define concatenation methods for Zarr arrays (either lazy or eager), which would generate the definition of the new array in the store as a reference to multiple existing arrays. Together with #287 (and an implementation of both in zarr-python), that would allow us to standardize the " Implementations might then be able to write code like import zarr
a1 = zarr.open("array1.zarr") # zarr.Array
a2 = zarr.open("array2.zarr") # zarr.Array
# this would create a Zarr array with the correct extensions to support indirection to a1 and a2
# (need to figure out the right API)
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")
a3[0] = 1 # write some data...automatically goes into the correct underlying array With Array-like concatenation like this could then open the door to wrapping via more expressive higher-level APIs like xarray, allowing users to go from many netCDFs to a single all-encompassing Zarr store using the same API calls they currently use to open all those netCDFs together using xarray (see pydata/xarray#8699) ds = xr.open_mfdataset(
'/my/files*.nc'
engine='zarrify', # an xarray IO backend that uses kerchunk to read byte ranges then returns lazy representations of zarr.Array objects with concat defined
combine='nested', # using 'by_coords' instead would actually check coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of lazily-concatenated zarr.Array objects
ds.zarrify.to_zarr(store='out.zarr') # xarray accessor that uses zarr-python to concatenate the lazy zarr arrays and writes the resulting zarr store out (which would conform to this ZEP) Concatenation of arbitrary-length chunks along the concatenation axis might be another way to get the variable-length chunks functionality proposed in ZEP003 (or ZEP003 might be necessary to support for implementations of this ZEP to be able to concatenate variable-length files). This proposal might find other use outside of the legacy file context, if for some reason you wanted to create a zarr array that used different codecs for different parts of the array. Backward CompatibilityThis change is fully backwards-compatible - all old data would remain usable. However, zarr arrays written as concatenated arrays would only be readable using v3 implementations that support this ZEP. Detailed descriptionThis section should provide a detailed description of the proposed change. It should include examples of how the new functionality would be used, intended A concatenated array "name": "concatenated"
"concatenation": {
"axis": 0, # int or list of ints
"arrays": ["../a", "../b"] # list of relative paths to other arrays in store
} Array-specific attributes (such as codecs) would not be specified, as they can be inferred by looking at the metadata of arrays Cycles of references to arrays (e.g. Multi-dimensional concatenation should be supported and one way would be by requiring Q: Should Diagram (3) on the right shows where the metadata for array Related WorkImplementationFor an implementation to support reading array Alternatives
Discussion
|
I'm not sure I understand this - does this happen before requesting the key from the store?
Not sure I understand this either - isn't this problem independent of this proposal, so long as we only allow concatenation of existing arrays within the same store? This seems more to do with how the chunk manifest or URL syntax would work.
Mentioned above. |
Is this question regarding URL syntax? If only arrays within a single store can be concatenated then URL syntax isn't needed. But concatenating arrays that are not in a single key-value store is an important use case. For example you might wish to combine some portion of a public dataset stored on s3 with something generated locally. I suppose this problem could be punted to the key-value store layer by just saying that you have to use a "url" key-value store where everything is part of a single store. But we still need to standardize on the URL syntax to allow different implementations to be compatible with concatenation arrays that specify arrays by URL. Currently nothing in zarr v3 relies on references to anything else in the key-value store so we haven't had to address any of these issues.
Yes the problem doesn't really apply if you can only reference arrays within the same store (especially if you prohibit "../" in the paths), but I imagine that users may often want to concatenate arrays that are not in the same store.
|
I think that initially scoping this to arrays within the same store is a great place to start in terms of prototyping. |
No I meant I don't understand the context of the coordinate transform stuff.
I hadn't really thought about this. But it's certainly easier to restrict scope for now as Ryan says.
If you used a chunk manifest to point to the public dataset then you could create a store that concatenated the remote data with the local data couldn't you? |
In tensorstore it is inferred, but for each array in the stack it is required to specify the domain as part of the JSON spec, so that the overall grid can be determined without doing any I/O. That means that the metadata for each component array (called a "layer" in tensorstore) can be opened lazily on demand. Therefore I would say that the JSON metadata should definitely be designed to allow the grid to be determined without having to read all of the component arrays, but I don't think the grid needs to be explicitly represented if it can be otherwise inferred (e.g. from a coordinate transform that is specified).
I don't think it would be helpful to limit to a single group but confused deputy problem issues should definitely be considered.
I think it would definitely be desirable to do it all as a single virtual zarr array, having possibly many nested virtual zarr arrays to do one multidimensional concatenation/stack would be unfortunate. However, I would suggest considering the tensorstore approach of "layers" with arbitrary coordinate transformations instead of concatenation. Just plain concatenation does not permit a common use case of stacking e.g. a bunch of 2-d arrays into a 3-d array.
Yes this would require variable-length data types, so we could defer this until we have standardized that, but should perhaps think about how it could work.
I don't think so.
I don't think this fits well as a storage transformer. To make it work with a storage transformer you would also need to make use of ZEP3 variable-size chunking and include the chunking of each component array, which would be annoying. |
I'm not sure how the concatenation would interact with storage transformers. But basically I'd say the goal should be to allow you to do any nested combination of np.stack or np.concatenate([arbitrary_indexing_op_0(array_0), arbitrary_indexing_op_1(array_1), ...]).
Yes but the same URL syntax issues apply to the chunk manifest proposal so if it has been addressed there then the same solution could equally well be used for concatenation. |
Thanks for all this quick feedback @jbms !
I agree about stacking - I completely forgot that was a separate function from
Huh, so the cropping you were referring to would be an example of an arbitrary indexing operation? That's a pretty big generalization...
They are just separate issues aren't they? You solve the URL problems in the chunk manifest, but the concatenated array doesn't care how exactly you solved them, only that there is an array with that name in the store. |
|
In tensorstore, index transforms provide a relatively compact json representation of any combination of:
This isn't tied to anything else in tensorstore and I do think it could make a lot of sense to just use the same json representation for zarr a concatenation/stack representation as well.
What I mean is that if we solve the issue for chunk manifest, then we will have a section in the zarr spec that describes how URLs are handled for that case. So we could simply refer to that for the concatenation case as well, and that would, in my opinion, be much more convenient than having to create a chunk manifest representation of an array just to concatenate it. After all, you could similarly restrict the chunk manifest feature to only work in the same store. In any case I agree that it would make sense to limit it to just the same store for now/for prototyping purposes, since the URL syntax is orthogonal to everything else. |
So when implemented, the original proposal above implies that implementations might want to provide a lazy serializable concatenatable array, and you're suggesting adding lazy serializable indexing to that too. The addition of indexing seems like a lot for one ZEP? Can we write this one in such a way that adding indexing later would still be compatible? Or you think it all needs to be done in one go? I'm just struggling to imagine what a JSON representation of the process of chaining that many operations on that many separate (and intermediate) arrays would be...
I was never suggesting that we needed a chunk manifest representation of an array just to concatenate it! I just want to be able to concatenate any array, no matter whether it used a chunk manifest or URL or inlined data or whatever.
I think we are on the same page about that. |
I think they could be potentially be separated, but we should design this proposal to accommodate "inline" composition of virtual views, i.e. without having to actually store the intermediate arrays as separate metadata files. I do think that without support for coordinate transforms/indexing, a significant fraction of what could be use cases for this virtual concatenation view will not be possible. For example, to get the equivalent of
You can see the tensorstore documentation for one example: https://google.github.io/tensorstore/python/api/tensorstore.stack.html
|
It occurred to me that concatenation could be supported as an array -> bytes codec in conjunction with ZEP 3 (variable-size chunks): The codec (maybe called
I'm not sure if the codec approach is the right one but it may be worth considering. |
I agree that stacking is very important to support, but I don't understand why indexing/coordinate transforms are required to support stacking. Why can't we just have something like "name": "concatenated"
"stack": {
"axis": 1, # int position of new axis
"arrays": ["../a", "../b"] # list of relative paths to other arrays in store
} and then the extra dim is just listed in the array's
Okay thank you, I think I'm beginning to understand what you've created in tensorstore with the layers. That's pretty cool! So we basically we would need a json-serializable representation of chaining arbitary operations, involving multiple intermediate arrays (which don't exist in the store). Is there an example of this kind of thing being done before? Ryan mentioned ncml, and you've mentioned tensorstores layers, but are there other examples to copy?
This seems clever, but also rather magical... Also wouldn't it not leave any space for indexing and coordinate transforms? Also also wouldn't this break the idea that you can know the shape of all arrays in the store only by reading metadata? You would actually have to fetch chunk data, decode it, then go back to the store now you know which arrays it refers to and fetch their metadata... |
Another possible use case for concatenation that came up talking to someone today: padding with NaNs to make dimensions match. The use case was zarr-ifying satellite swaths that have dimension lengths that almost but don't quite align. I think we could use concatenation to fix this by creating a VirtualZarrArray which only contained a fill value. |
Some things that came up in the ZEP meeting today:
|
I think we could have a concept of "virtual arrays" that exist purely as JSON metadata (this would be very similar to the tensorstore spec concept). We could then define a way to specify these purely as a self-contained JSON object that isn't necessarily stored anywhere. In the concatenation virtual array metadata, the component arrays could then be specified either as paths/urls to other stored arrays, or as inline JSON virtual arrays. This would provide a way to arbitrarily compose virtual arrays without creating separate files for storing them. I would suppose that in these inline JSON virtual arrays, any paths/urls would be interpreted relative to some "base url" implied by the context. |
This sounds like a good direction! |
I'm not sure I understand - are you saying these JSON objects would live outside of a Zarr store entirely?? Or is this suggestion an idea for implementing "hidden arrays"? The arrays we want to "hide" are the ones that have the real data in them (or they point to the real data via chunk manifests). |
I imagined that the json object would be essentially interchangeable with a path/url to a stored zarr array, and could therefore be passed directly to e.g. |
Splitting out discussion from #287
See Ryan's comment here:
#287 (comment)
Stacking/virtual views are implemented in tensorstore here: https://google.github.io/tensorstore/driver/stack/index.html
A few points that I think need to be addressed:
The text was updated successfully, but these errors were encountered: