-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4
Comments
cc: @cisaacstern |
I’ve been doing some background reading on parquet. Here are some useful things I’ve learned:
|
The thrift definition is especially helpful to read. https://github.com/dask/fastparquet/blob/main/fastparquet/parquet.thrift Some initial thoughts:
|
Reading the Data Page V1 algorithm in these sources (https://github.com/dask/fastparquet/blob/main/fastparquet/core.py), I think most Zarr would be represented in a Plain Thrift encoding. |
I've just started chatting with @martindurant in a side channel (Martin, I'm happy to bring the discussion here). He has a few thoughts on the matter:
I'm really happy to have a vote of confidence on this approach from a domain expert! |
I've discussed how to surface Zarrquet to databases with Martin a bit more. Here's the best path forward from the discussion:
|
So, hi all, let me jump in here, because I think we are going on the same direction... In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays. Or in a virtual way, you are thinking on creating a "virtual" parquet file with the contents of a geozarr. And then BigQuery using a external table (parquet virtual tables), you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage? |
Yes, this is totally doable, to make a python file-like where the buffers are fetched from zarr chunks, and the parquet thrift metadata are made dynamically at runtime (or, optionally, stored elsewhere, which might be the easier way to start).
How does one persuade bigquery to read from a python file? We can get pyarrow or even duckdb to do this, but ...
Meaning simple things like gzip/zstd compression? |
Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes? |
Yes, this is the end game.
Yes — and the big unknown for me is the cloud architecture question. How do you build a proxy GCS bucket? I was imagining an elastic load balancer would play a key role. |
I am thinking less virtual in a way, what about the other no dynamic way.
As a static file. |
This idea has a lot of merit. We did something similar over here: https://github.com/google/weather-tools/blob/main/weather_mv/README.md The best place for the static approach (copying Zarr into Parquet) would be as an output sink for In fact, this repo was inspired by this weather tool. The goal here is to find a way to provide a similar experience to the one offered by |
Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file? I have began thinking this is possible! It looks like there is room in Parquet’s thrift format to point to Zarr array blobs as the backing data. I have some notes about this around here: #4 (comment) |
You can store the thrift serialised objects in file(s) rather than make them on the fly; but no, parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except |
Also, I was chatting about this yesterday: it is totally possible to store N-D arrays in parquet despite its columnar nature: a 3D array would be list(list(number)), and the required rep/def levels would be easy to generate, as they would be regularly repeating. In V2 pages, the buffer is compressed separately from the rep/defs, so streaming data from chunks would work fine. The question is, do any parquet frameworks expect this sort of data? |
Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower. |
I am a bit lost here since I thought we would be able to store the arrays
as arrow.
Not the same but geoarrow is doing this for vector data. I assumed using
nested arrays in arrow was going to be no problem.
…On Wed, 10 Apr 2024 at 06:44, Martin Durant ***@***.***> wrote:
Also work noting (sorry for the spamming): fastparquet allows you to
directly instantiate and serialised thrift objects. I don't believe arrow
has any such ability. python-thrift also allows this, but is much slower.
—
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7GO4PISGMYN65MOPZ2G3Y4U62BAVCNFSM6AAAAABCGQTZDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBXGU4DKMBQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
What do you mean by "arrow" here? Parquet is not arrow; feather2 essentially is.
Yes, the parquet format can store nested data, which would include geo structures (like list of points -> polygon, for each row). But this is presumably not what you have in zarr, right? Zarr does not have a concept of variable-length structures in each array position, except for string, JSON and pickle encoding (i.e., python objects, not what you want) So: please show me some example zarr data you'd like to view, and I'll see if I can make a python file-like to hand to pyarrow. But I'll be using fastparquet to make the thrift, since I don't believe there's any other way. |
Thanks Martin. That saves me from having to discover this first hand! Do you know of any format that would be a good fit, that allows for extra file storage? Feather, Avro, ORC, maybe?
Yes, this is really interesting! I agree, though, I don't see a use case for it for me in the near term.
In a larger context, I see this work as largely experimental. It very well could not pan out as I find more information! Take these plans with a grain of salt. |
Sorry, no |
Given zarr-developers/zarr-specs#287, it looks like future versions of Zarr will allow an fsspec interface beyond Python. It seems like a proxy server won’t be required. |
Pathway to Postgres integration: https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse#overview |
You want postgres to query zarr data chunks with a parquet dynamic ephemeral intermediate representation? Actually, teaching duckDB about zarr might be simpler :) . |
@TomNicholas I just saw your tweet related to “headless data” and Pangeo. https://x.com/TEGNicholasCode/status/1813299299201294583 I wanted to reply: I think something like “Zarrquet” could be a good fit for the SQL interoperability box for pangeo. |
https://arrow.apache.org/docs/python/filesystems.html
Like Kerchunk, it should be possible to create a virtual parquet dataset from an existing Zarr store. Here, it would also provide a table-like view of a dense array — like the rest of qarray.
Ideally, this could be persisted in a file system without creating extra space (just metadata). Then, one could use tools like Pandas, DuckDB, Apache Arrow, etc., to query raster data as tables, maybe even outside of Python.
My bet here is that the LOC to prototype this is going to be small. Getting it performant is another matter. I wonder if this would be a justification for BigQuery or other similarly high powered query engine? BQ reads parquet!
The text was updated successfully, but these errors were encountered: