Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing to parquet (following kerchunk format) #72

Closed
TomNicholas opened this issue Apr 4, 2024 · 16 comments · Fixed by #110
Closed

Writing to parquet (following kerchunk format) #72

TomNicholas opened this issue Apr 4, 2024 · 16 comments · Fixed by #110
Labels
enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself

Comments

@TomNicholas
Copy link
Member

It would be great to add the ability to write kerchunk references to parquet files, not just to json.

This should be a nice self-contained feature for anyone who is interested in implementing it - it just goes in the xarray accessor. (@norlandrhagen ? 😁 )

For implementation I'm not very familiar with the options. I see there is fastparquet, but if we already have an in-memory complete ManifestArray object (which could well become a numpy structured array in #39), it looks like we could just write to parquet from a pandas dataframe?

In fact, could we even go from the xr.Dataset to the pandas dataframe directly using Dataset.to_dataframe()?? That would be super neat, but I don't understand the kerchunk parquet format well enough yet to know how easily that would work.

@TomNicholas TomNicholas added enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself labels Apr 4, 2024
@TomNicholas TomNicholas mentioned this issue Apr 4, 2024
15 tasks
@norlandrhagen
Copy link
Collaborator

I'll see if I can find some time!

@alxmrs
Copy link

alxmrs commented Apr 10, 2024

I may be interested, too. This seems similar to alxmrs/xarray-sql#4

Tom, can you help me understand this feature better? I’m not sure how kerchunk json references work today.

Xarray also has a to_dask_dataframe() method.

@TomNicholas
Copy link
Member Author

TomNicholas commented Apr 10, 2024

Nice to see you here @alxmrs !

This seems similar to alxmrs/xarray-sql#4

I don't think this is the same thing. IIUC that issue suggests taking some existing data on-disk in a zarr store, and creating a virtual parquet file (or in-memory equivalent) to index into the zarr data on disk. So the user sees parquet. Whereas the feature in this issue only uses parquet as a means to an end, not as either the original data format nor as how the format in which the data is presented to the user.

Tom, can you help me understand this feature better? I’m not sure how kerchunk json references work today.

The kerchunk library can read netCDF files on disk and create an in-memory representation of the byte ranges into those files which you would need to read to fetch particular chunks. Given that representation (the "references") ffspec can read the actual bytes from the original files, as if it were reading from a zarr store. You can cache those references to disk as a json file (in the kerchunk format), and fsspec can understand that too.

If you have a massive amount of data, caching as json might become a bottleneck, which is why kerchunk also implements caching the byte range references to disk as a parquet file (and fsspec can read that).

All of this is still a way of reading array data in zarr-like form. The original data is in netCDF, not parquet, and the data is read like zarr, not like parquet. Only the list of bytes ranges for each zarr chunk happens to be saved to disk as a parquet file, the format of which is essentially an implementation detail.

@alxmrs
Copy link

alxmrs commented Apr 16, 2024

Pardon my confusion; I was excited at the chance of an overlap. It's been great to follow along with this project so far!

Thanks for the explanation and the link. TIL that kerchunk has parquet references in the first place.

Raphael, I bet you're closer to this issue than I am.

@jsignell
Copy link
Contributor

jsignell commented May 3, 2024

I'm admittedly still getting my bearings, but my understanding was that the objective of this library is to store the references in an array format rather than a key, value store. To me that sound like you would want to store the references themselves in zarr.

I'm trying to understand how writing to parquet fits into this vision. Is this part of a goal to attain feature-completeness relative to kerchunk? Or am I missing something?

@TomNicholas
Copy link
Member Author

I'm admittedly still getting my bearings, but my understanding was that the objective of this library is to store the references in an array format rather than a key, value store. To me that sound like you would want to store the references themselves in zarr.

We need to distinguish between in-memory representation of references and on-disk storage. For the in-memory representation the aim (which has been achieved) is to use arrays, i.e. the ManifestArray class. But this is a separate question from the on-disk storage format for the references.

I'm trying to understand how writing to parquet fits into this vision. Is this part of a goal to attain feature-completeness relative to kerchunk? Or am I missing something?

This issue is just to track feature-completeness relative to kerchunk, exactly. Eventually we would like to use zarr v3 chunk manifests as the on-disk format instead. It would be great for that zarr format to have the same scalability advantages as writing kerchunk-specification parquet does. To that end a few people have suggested using a zarr array to store the references on-disk (see e.g. Ryan's comment here #33 (comment)). Until that's all available writing to kerchunk parquet is still useful.

@jsignell
Copy link
Contributor

jsignell commented May 3, 2024

Ok yes on-disk vs in-memory. That is helpful to keep as a frame of reference.

Eventually we would like to use zarr v3 chunk manifests as the on-disk format instead. It would be great for that zarr format to have the same scalability advantages as writing kerchunk-specification parquet does.

Is it not in the scope of this library to implement zarr v3 chunk manifests and then upstream them into zarr proper?

@TomNicholas
Copy link
Member Author

TomNicholas commented May 3, 2024

Implementing chunk manifests as a V3 extension and upstreaming as much as possible is definitely the end goal yes! But V3 etc. is not yet available, and this library was deliberately designed so that in the meantime we can still make the task of "kerchunking" complicated datasets easier, by writing out to kerchunk format. Support for writing to parquet is (an optional but useful) part of that latter aim.

@martindurant
Copy link
Member

Please speak to me about the the specifics of what parquet can and can't do, and why kerchunk's parquet format was designed the way it is.

@TomNicholas
Copy link
Member Author

Hi @martindurant - I would welcome your input here. To be clear, are you suggesting we set up a call specifically to talk about parquet as a references format? Or just saying that you're happy to provide input on issues in general?

@martindurant
Copy link
Member

I can write it out, but probably a conversation is better.

Firstly, I would say it would be a real shame if zarr were to depend on xarray, pandas or even arrow in the forseeable future. Fastparquet is in the (slow) process of moving to pure-numpy rather than interfacing with pandas.

@TomNicholas
Copy link
Member Author

TomNicholas commented May 6, 2024

Firstly, I would say it would be a real shame if zarr were to depend on xarray, pandas or even arrow in the foreseeable future.

I'm confused - as far as I know no-one is suggesting that zarr-python gain additional dependencies.

I can write it out, but probably a conversation is better.

Happy to set up a call - I'll email you now.

@martindurant
Copy link
Member

There was talk in the related thread (I think) about xarray->pandas->parquet for manifests - unless I got confused.

@TomNicholas
Copy link
Member Author

TomNicholas commented May 6, 2024

There was talk in the related thread (I think) about xarray->pandas->parquet for manifests - unless I got confused.

That part was just a discussion about one idea for how best to implement writing kerchunk-formatted references to parquet within this VirtualiZarr package. It doesn't imply any change to zarr readers. (And this package already depends on xarray, which depends on pandas.)

@jsignell
Copy link
Contributor

I'm going to pick this one up. I'll be sure to reach out if I run into anything Martin.

@martindurant
Copy link
Member

Any time :). The current kerchunk parquet format happened more or less organically, so there may well be a lot of room for improvement, and zarr may well yet win the day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants