-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175
base: main
Are you sure you want to change the base?
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175
Conversation
Just wanted to say that the Zarr V3 class behaviors are not set in stone yet. If you have feedback on the V3 API and potential incompatibilities between V2, please report them issues on Zarr-python: https://github.com/zarr-developers/zarr-python/issues |
My sense is that in terms of file format both zarr v2 and zarr v3 should be supported but in terms of library dependencies it would be fine to mandate that only zarr-python >= 3 is supported. |
Thanks for taking the initiative here @ghidalgo3 !
I was actually hoping we could use some class from
I see two ways to do this:
I'm not sure which of these two is the best approach. I also completely agree with @jsignell - all of this should be supported through only a dependency on |
To minimize the code changes needed in VirtualiZarr, I'll attempt the first option. |
Adding a dependency on See the reorg that happened here about 3 months ago for ZarrV3: zarr-developers/zarr-python#1809 and some attempts in Kerchunk to support ZarrV3: Maybe the |
@ghidalgo3 please please open a Zarr issue to track any API incompatibilities between V2 and V3. 🙏 |
@ghidalgo3 it might be a good idea to have a separate first PR that makes virtualizarr pass every test using zarr v3, and then coming back to this PR once that compatibility can be relied upon. Otherwise this might become a rabbit hole. |
That class is public API, but it might be changing soon, and while it is intended for this use case it also has some extra stuff that goes beyond "representing metadata". So it should do what you want, eventually :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far @ghidalgo3 !
virtualizarr/manifests/array.py
Outdated
""" | ||
Individual chunk size by number of elements. | ||
""" | ||
if isinstance(self._zarray.chunk_grid, RegularChunkGrid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice check, but I think we could actually perform this check at ManifestArray
construction time, because right now a lot of other things will break if the chunk grid is not regular.
match zarray: | ||
case ArrayV2Metadata(compressor=compressor, filters=filters): | ||
return Codec(compressor=compressor, filters=filters) | ||
case ArrayV3Metadata(codecs=codecs): | ||
return codecs | ||
case _: | ||
raise ValueError("Unknown ArrayMetadata type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer this match...case
over a standard if isinstance()...else
syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly a style preference coming from languages with discriminated unions and compilers, but if you'd like the whole code base to be consistent I can rewrite it with isinstance
.
chunks=chunks, | ||
compressor="zlib", | ||
compressor={"id": "zlib"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the presence of "id"
a v2 vs v3 difference? because you don't seem to use it in the other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In V3, the codecs
array is made up of (name, configuration)
objects, but in V2 the id
field serves as the name. In this particular test, I just chose to convert the ZArray
to an ArrayV2Metadata
Actually please ignore any of the changes in tests
for now because I will need to edit almost every test to excise ZArray
:) The few changes I added were done up until I hit the kerchunk
module import issue
class Codec(BaseModel): | ||
compressor: str | None = None | ||
""" | ||
ZarrayV2 codec definition. | ||
""" | ||
|
||
compressor: str | dict[str, Any] | None = None | ||
""" | ||
If it's a string, it's the compressor ID. | ||
If it's a dict, it's the full compressor configuration. | ||
""" | ||
filters: list[dict] | None = None | ||
|
||
def __repr__(self) -> str: | ||
return f"Codec(compressor={self.compressor}, filters={self.filters})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class be moved upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, because ZarrV3 now has a notion of codecs which is different from VirtualiZarr's notion of codecs. A VirtualiZarr codec is really the combination of filters
and compressor
from V2 metadata, which are easily transformable to V3 codecs.
Maybe a zarr.ArrayV2Metadata
can have a property that projects the compressor
and filters
into V3 Codec
s, @d-v-b thoughts on that? In that case, Codec
could even be a property of the ABC ArrayMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think that codecs: ListWithInternalStructure[Codec]
(i.e., what zarr v3 metadata uses) is not the best interface. See a writeup on that point of view here.
But if we ignore that concern for a moment, I'm curious about @TomNicholas's question: what could we change in zarr-python so that you could just import a functionally equivalent class from there (or import a dataclass and wrap with pydantic)? I have a selfish interest in this question since I'm likely going to follow in your footsteps when I implement better codec support over in pydantic-zarr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for forcing me to read the Zarr code more closely! I found VirtualiZarr could probably just use create_codec_pipeline to turn V2 or V3 metadata into a BatchedCodecPipeline
and then VirtualiZarr can just deal with CodecPipeline
as the abstraction over V2 and V3 . Additionally, VirtualiZarr could change its Codec
to use V2Filters
and V2Compressor
from Zarr
, or I might get rid of it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does Virtualizarr do with the codecs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All virtualizarr currently needs to do with the codecs is:
- Check that any two arrays you ask it to concatenate have the same codecs (see Virtual concatenation of arrays with different codecs or dtypes #5)
- Make sure the codecs are recorded in the kerchunk references when they get written out to disk.
Before going further with these changes, I'd like to discuss if this direction is correct or not. Based on #17, I tried using Zarr-V3's Array definition instead of VirtualiZarr's
ZArray
definition. Keeping the names straight is hard,ZArray
is the current code andArray
iszarr
's type.The problems I ran into were:
kwargs
forArray
are different between v2 and v3 for zarr, butVirtualiZarr.ZArray
I believe is first created with the v2 properties and then transformed to v3 format whenzarr_v3_array_metadata
is called. After the replacement, anywhere VirtualiZarr creates aArray
, one must choose v2 or v3kwargs
, or again introduce something that abstracts over the versions. Unless VirtualiZarr chooses to only support Zarr V3 only, which I believe is the intention.zarr.array.metadata
because the internal zarr property names have changed. Probably there will be other code-level incompatibilities from metadata property name differences, I think Is thecompressor
type right? #94 ran into this too.This is just from a few hours of trying this out.
If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing
ZArray
withArray
should not be too difficult.Thoughts?
Incompatibilities:
To be filled later:
docs/releases.rst
api.rst