Writing to parquet (following kerchunk format) #72

TomNicholas · 2024-04-04T18:58:33Z

It would be great to add the ability to write kerchunk references to parquet files, not just to json.

This should be a nice self-contained feature for anyone who is interested in implementing it - it just goes in the xarray accessor. (@norlandrhagen ? 😁 )

For implementation I'm not very familiar with the options. I see there is fastparquet, but if we already have an in-memory complete ManifestArray object (which could well become a numpy structured array in #39), it looks like we could just write to parquet from a pandas dataframe?

In fact, could we even go from the xr.Dataset to the pandas dataframe directly using Dataset.to_dataframe()?? That would be super neat, but I don't understand the kerchunk parquet format well enough yet to know how easily that would work.

The text was updated successfully, but these errors were encountered:

norlandrhagen · 2024-04-04T19:04:36Z

I'll see if I can find some time!

alxmrs · 2024-04-10T17:42:59Z

I may be interested, too. This seems similar to alxmrs/xarray-sql#4

Tom, can you help me understand this feature better? I’m not sure how kerchunk json references work today.

Xarray also has a to_dask_dataframe() method.

TomNicholas · 2024-04-10T18:46:14Z

Nice to see you here @alxmrs !

This seems similar to alxmrs/xarray-sql#4

I don't think this is the same thing. IIUC that issue suggests taking some existing data on-disk in a zarr store, and creating a virtual parquet file (or in-memory equivalent) to index into the zarr data on disk. So the user sees parquet. Whereas the feature in this issue only uses parquet as a means to an end, not as either the original data format nor as how the format in which the data is presented to the user.

Tom, can you help me understand this feature better? I’m not sure how kerchunk json references work today.

The kerchunk library can read netCDF files on disk and create an in-memory representation of the byte ranges into those files which you would need to read to fetch particular chunks. Given that representation (the "references") ffspec can read the actual bytes from the original files, as if it were reading from a zarr store. You can cache those references to disk as a json file (in the kerchunk format), and fsspec can understand that too.

If you have a massive amount of data, caching as json might become a bottleneck, which is why kerchunk also implements caching the byte range references to disk as a parquet file (and fsspec can read that).

All of this is still a way of reading array data in zarr-like form. The original data is in netCDF, not parquet, and the data is read like zarr, not like parquet. Only the list of bytes ranges for each zarr chunk happens to be saved to disk as a parquet file, the format of which is essentially an implementation detail.

alxmrs · 2024-04-16T11:59:01Z

Pardon my confusion; I was excited at the chance of an overlap. It's been great to follow along with this project so far!

Thanks for the explanation and the link. TIL that kerchunk has parquet references in the first place.

Raphael, I bet you're closer to this issue than I am.

jsignell · 2024-05-03T18:06:37Z

I'm admittedly still getting my bearings, but my understanding was that the objective of this library is to store the references in an array format rather than a key, value store. To me that sound like you would want to store the references themselves in zarr.

I'm trying to understand how writing to parquet fits into this vision. Is this part of a goal to attain feature-completeness relative to kerchunk? Or am I missing something?

TomNicholas · 2024-05-03T19:02:41Z

I'm admittedly still getting my bearings, but my understanding was that the objective of this library is to store the references in an array format rather than a key, value store. To me that sound like you would want to store the references themselves in zarr.

We need to distinguish between in-memory representation of references and on-disk storage. For the in-memory representation the aim (which has been achieved) is to use arrays, i.e. the ManifestArray class. But this is a separate question from the on-disk storage format for the references.

I'm trying to understand how writing to parquet fits into this vision. Is this part of a goal to attain feature-completeness relative to kerchunk? Or am I missing something?

This issue is just to track feature-completeness relative to kerchunk, exactly. Eventually we would like to use zarr v3 chunk manifests as the on-disk format instead. It would be great for that zarr format to have the same scalability advantages as writing kerchunk-specification parquet does. To that end a few people have suggested using a zarr array to store the references on-disk (see e.g. Ryan's comment here #33 (comment)). Until that's all available writing to kerchunk parquet is still useful.

jsignell · 2024-05-03T20:12:03Z

Ok yes on-disk vs in-memory. That is helpful to keep as a frame of reference.

Eventually we would like to use zarr v3 chunk manifests as the on-disk format instead. It would be great for that zarr format to have the same scalability advantages as writing kerchunk-specification parquet does.

Is it not in the scope of this library to implement zarr v3 chunk manifests and then upstream them into zarr proper?

TomNicholas · 2024-05-03T21:25:12Z

Implementing chunk manifests as a V3 extension and upstreaming as much as possible is definitely the end goal yes! But V3 etc. is not yet available, and this library was deliberately designed so that in the meantime we can still make the task of "kerchunking" complicated datasets easier, by writing out to kerchunk format. Support for writing to parquet is (an optional but useful) part of that latter aim.

martindurant · 2024-05-06T19:15:47Z

Please speak to me about the the specifics of what parquet can and can't do, and why kerchunk's parquet format was designed the way it is.

TomNicholas · 2024-05-06T19:23:14Z

Hi @martindurant - I would welcome your input here. To be clear, are you suggesting we set up a call specifically to talk about parquet as a references format? Or just saying that you're happy to provide input on issues in general?

martindurant · 2024-05-06T19:25:16Z

I can write it out, but probably a conversation is better.

Firstly, I would say it would be a real shame if zarr were to depend on xarray, pandas or even arrow in the forseeable future. Fastparquet is in the (slow) process of moving to pure-numpy rather than interfacing with pandas.

TomNicholas · 2024-05-06T19:27:52Z

Firstly, I would say it would be a real shame if zarr were to depend on xarray, pandas or even arrow in the foreseeable future.

I'm confused - as far as I know no-one is suggesting that zarr-python gain additional dependencies.

I can write it out, but probably a conversation is better.

Happy to set up a call - I'll email you now.

martindurant · 2024-05-06T19:29:26Z

There was talk in the related thread (I think) about xarray->pandas->parquet for manifests - unless I got confused.

TomNicholas · 2024-05-06T19:33:34Z

There was talk in the related thread (I think) about xarray->pandas->parquet for manifests - unless I got confused.

That part was just a discussion about one idea for how best to implement writing kerchunk-formatted references to parquet within this VirtualiZarr package. It doesn't imply any change to zarr readers. (And this package already depends on xarray, which depends on pandas.)

jsignell · 2024-05-13T16:57:44Z

I'm going to pick this one up. I'll be sure to reach out if I run into anything Martin.

martindurant · 2024-05-13T16:59:46Z

Any time :). The current kerchunk parquet format happened more or less organically, so there may well be a lot of room for improvement, and zarr may well yet win the day.

TomNicholas added enhancement New feature or request Kerchunk Relating to the kerchunk library / specification itself labels Apr 4, 2024

TomNicholas mentioned this issue Apr 4, 2024

Initial release checklist #2

Closed

15 tasks

TomNicholas mentioned this issue May 2, 2024

Reading from dmrcp index files? #85

Closed

jsignell mentioned this issue May 13, 2024

Write to parquet #110

Merged

TomNicholas closed this as completed in #110 May 15, 2024

TomNicholas mentioned this issue May 15, 2024

Error writing to parquet using kerchunk #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to parquet (following kerchunk format) #72

Writing to parquet (following kerchunk format) #72

TomNicholas commented Apr 4, 2024

norlandrhagen commented Apr 4, 2024

alxmrs commented Apr 10, 2024

TomNicholas commented Apr 10, 2024 •

edited

Loading

alxmrs commented Apr 16, 2024

jsignell commented May 3, 2024

TomNicholas commented May 3, 2024

jsignell commented May 3, 2024

TomNicholas commented May 3, 2024 •

edited

Loading

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024 •

edited

Loading

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024 •

edited

Loading

jsignell commented May 13, 2024

martindurant commented May 13, 2024

Writing to parquet (following kerchunk format) #72

Writing to parquet (following kerchunk format) #72

Comments

TomNicholas commented Apr 4, 2024

norlandrhagen commented Apr 4, 2024

alxmrs commented Apr 10, 2024

TomNicholas commented Apr 10, 2024 • edited Loading

alxmrs commented Apr 16, 2024

jsignell commented May 3, 2024

TomNicholas commented May 3, 2024

jsignell commented May 3, 2024

TomNicholas commented May 3, 2024 • edited Loading

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024 • edited Loading

martindurant commented May 6, 2024

TomNicholas commented May 6, 2024 • edited Loading

jsignell commented May 13, 2024

martindurant commented May 13, 2024

TomNicholas commented Apr 10, 2024 •

edited

Loading

TomNicholas commented May 3, 2024 •

edited

Loading

TomNicholas commented May 6, 2024 •

edited

Loading

TomNicholas commented May 6, 2024 •

edited

Loading