-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to parquet (following kerchunk format) #72
Comments
I'll see if I can find some time! |
I may be interested, too. This seems similar to alxmrs/xarray-sql#4 Tom, can you help me understand this feature better? I’m not sure how kerchunk json references work today. Xarray also has a |
Nice to see you here @alxmrs !
I don't think this is the same thing. IIUC that issue suggests taking some existing data on-disk in a zarr store, and creating a virtual parquet file (or in-memory equivalent) to index into the zarr data on disk. So the user sees parquet. Whereas the feature in this issue only uses parquet as a means to an end, not as either the original data format nor as how the format in which the data is presented to the user.
The kerchunk library can read netCDF files on disk and create an in-memory representation of the byte ranges into those files which you would need to read to fetch particular chunks. Given that representation (the "references") If you have a massive amount of data, caching as json might become a bottleneck, which is why kerchunk also implements caching the byte range references to disk as a parquet file (and All of this is still a way of reading array data in zarr-like form. The original data is in netCDF, not parquet, and the data is read like zarr, not like parquet. Only the list of bytes ranges for each zarr chunk happens to be saved to disk as a parquet file, the format of which is essentially an implementation detail. |
Pardon my confusion; I was excited at the chance of an overlap. It's been great to follow along with this project so far! Thanks for the explanation and the link. TIL that kerchunk has parquet references in the first place. Raphael, I bet you're closer to this issue than I am. |
I'm admittedly still getting my bearings, but my understanding was that the objective of this library is to store the references in an array format rather than a key, value store. To me that sound like you would want to store the references themselves in zarr. I'm trying to understand how writing to parquet fits into this vision. Is this part of a goal to attain feature-completeness relative to kerchunk? Or am I missing something? |
We need to distinguish between in-memory representation of references and on-disk storage. For the in-memory representation the aim (which has been achieved) is to use arrays, i.e. the
This issue is just to track feature-completeness relative to kerchunk, exactly. Eventually we would like to use zarr v3 chunk manifests as the on-disk format instead. It would be great for that zarr format to have the same scalability advantages as writing kerchunk-specification parquet does. To that end a few people have suggested using a zarr array to store the references on-disk (see e.g. Ryan's comment here #33 (comment)). Until that's all available writing to kerchunk parquet is still useful. |
Ok yes on-disk vs in-memory. That is helpful to keep as a frame of reference.
Is it not in the scope of this library to implement zarr v3 chunk manifests and then upstream them into zarr proper? |
Implementing chunk manifests as a V3 extension and upstreaming as much as possible is definitely the end goal yes! But V3 etc. is not yet available, and this library was deliberately designed so that in the meantime we can still make the task of "kerchunking" complicated datasets easier, by writing out to kerchunk format. Support for writing to parquet is (an optional but useful) part of that latter aim. |
Please speak to me about the the specifics of what parquet can and can't do, and why kerchunk's parquet format was designed the way it is. |
Hi @martindurant - I would welcome your input here. To be clear, are you suggesting we set up a call specifically to talk about parquet as a references format? Or just saying that you're happy to provide input on issues in general? |
I can write it out, but probably a conversation is better. Firstly, I would say it would be a real shame if zarr were to depend on xarray, pandas or even arrow in the forseeable future. Fastparquet is in the (slow) process of moving to pure-numpy rather than interfacing with pandas. |
I'm confused - as far as I know no-one is suggesting that zarr-python gain additional dependencies.
Happy to set up a call - I'll email you now. |
There was talk in the related thread (I think) about xarray->pandas->parquet for manifests - unless I got confused. |
That part was just a discussion about one idea for how best to implement writing kerchunk-formatted references to parquet within this VirtualiZarr package. It doesn't imply any change to zarr readers. (And this package already depends on xarray, which depends on pandas.) |
I'm going to pick this one up. I'll be sure to reach out if I run into anything Martin. |
Any time :). The current kerchunk parquet format happened more or less organically, so there may well be a lot of room for improvement, and zarr may well yet win the day. |
It would be great to add the ability to write kerchunk references to parquet files, not just to json.
This should be a nice self-contained feature for anyone who is interested in implementing it - it just goes in the xarray accessor. (@norlandrhagen ? 😁 )
For implementation I'm not very familiar with the options. I see there is fastparquet, but if we already have an in-memory complete ManifestArray object (which could well become a numpy structured array in #39), it looks like we could just write to parquet from a pandas dataframe?
In fact, could we even go from the
xr.Dataset
to the pandas dataframe directly usingDataset.to_dataframe()
?? That would be super neat, but I don't understand the kerchunk parquet format well enough yet to know how easily that would work.The text was updated successfully, but these errors were encountered: