Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to parquet #110

Merged
merged 7 commits into from
May 15, 2024
Merged

Write to parquet #110

merged 7 commits into from
May 15, 2024

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented May 13, 2024

Closes #72

I'm pretty sure this is not what you had in mind, but wanted to put this up as at least an option. Even if we scrap the implementation, the test should stay the same.

I didn't include remote writing since it seemed like that would be something that we'd want to implement for all the file outputs not just for one.

1: "foo.nc",
},
"size": {0: 100, 1: 100},
"raw": {0: None, 1: None},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "raw" mean in this specification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it means "inlined" in the kerchunk library tests this ref: "a/6": b"data", results in a non-null "raw"

https://github.com/fsspec/kerchunk/blob/a5eae4d45f7f601346d0fdc58c0864911330c5d6/kerchunk/tests/test_df.py#L62

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jsignell ! That was quick 😁

I'm pretty sure this is not what you had in mind, but wanted to put this up as at least an option. Even if we scrap the implementation, the test should stay the same.

Agree that's useful already! I think this implementation is useful to start with, and tests ensure we can change it later without breaking behaviour.

I was wondering if there is a more performant way to go from ManifestArray to parquet, but that implementation will depend on how the references are stored in-memory, so should wait until #107 is merged.

I didn't include remote writing since it seemed like that would be something that we'd want to implement for all the file outputs not just for one.

👍

virtualizarr/xarray.py Outdated Show resolved Hide resolved
virtualizarr/xarray.py Show resolved Hide resolved
@TomNicholas TomNicholas added the references formats Storing byte range info on disk label May 13, 2024
@TomNicholas
Copy link
Member

@jsignell to merge this we should have some docs (on the usage page).

It also might be worth checking that round-tripping works - see the in-progress PR #42.

@jsignell
Copy link
Contributor Author

I just stole the easiest test off that PR for roundtripping. See what you think

@TomNicholas
Copy link
Member

This looks good to me! Let's get it in and iterate if necessary!

Thanks @jsignell !

@TomNicholas TomNicholas merged commit f9ca667 into zarr-developers:main May 15, 2024
5 checks passed
@jsignell jsignell deleted the parquet branch May 15, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references formats Storing byte range info on disk
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing to parquet (following kerchunk format)
2 participants