Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

alxmrs · 2024-01-23T07:56:57Z

https://arrow.apache.org/docs/python/filesystems.html

Like Kerchunk, it should be possible to create a virtual parquet dataset from an existing Zarr store. Here, it would also provide a table-like view of a dense array — like the rest of qarray.

Ideally, this could be persisted in a file system without creating extra space (just metadata). Then, one could use tools like Pandas, DuckDB, Apache Arrow, etc., to query raster data as tables, maybe even outside of Python.

My bet here is that the LOC to prototype this is going to be small. Getting it performant is another matter. I wonder if this would be a justification for BigQuery or other similarly high powered query engine? BQ reads parquet!

alxmrs · 2024-01-23T07:57:13Z

cc: @cisaacstern

alxmrs · 2024-03-17T08:18:52Z

I’ve been doing some background reading on parquet. Here are some useful things I’ve learned:

Arrow is the in memory version of the file format.
The project is 11 or so years old and comes from the Hadoop ecosystem (i.e. is well battle tested)
A core feature is column oriented chunks. Like, the format groups rows together abstractly but puts each column of a set of rows in a logical, if not physical grouping (column chunks can be one or more file).
- It seems like Zarr chunks would map on to parquet well IFF they can be interpreted as column chunks. If I have the right mental model, I think unraveled Zarr would be a compatible with parquet.
Martin’s fastparquet is a good reference to understand the file format nitty-gritty: https://github.com/dask/fastparquet

alxmrs · 2024-03-17T09:03:59Z

The thrift definition is especially helpful to read.

https://github.com/dask/fastparquet/blob/main/fastparquet/parquet.thrift

Some initial thoughts:

is it possible that a Zarr chunk in a bucket could be the filepath of a column chunk?
If we can make calculating the metadata optional, maybe we could get slower, but readable zarrquet.

alxmrs · 2024-03-17T09:25:24Z

Reading the Data Page V1 algorithm in these sources (https://github.com/dask/fastparquet/blob/main/fastparquet/core.py), I think most Zarr would be represented in a Plain Thrift encoding.

alxmrs · 2024-03-18T05:20:22Z

I've just started chatting with @martindurant in a side channel (Martin, I'm happy to bring the discussion here). He has a few thoughts on the matter:

First, he's been interested in this kind of feature for a long time (Contributing a CSV module [RE: dask dataframe read_csv] fsspec/kerchunk#66). This doesn't surprise me at all :)
Feather is probably a better fit for mapping to Zarr than parquet, but parquet is more ubiquitous for representing tables
It is quite possible to represent Zarr as parquet, say, without nested columns or other corner case features.

I'm really happy to have a vote of confidence on this approach from a domain expert!

alxmrs · 2024-03-27T05:22:13Z

I've discussed how to surface Zarrquet to databases with Martin a bit more. Here's the best path forward from the discussion:

Create a PyArrow or fsspec FileSystem to read Zarrquet in Python
- Test it on DuckDB's Python interface
For all other DBs, we'll use this Python client as an adaptor to a Parquet file interface in a proxy server.
- TBD if this would only act as a local server or something friendly to distributed reads.
- A good examples to follow (thanks Martin): https://github.com/bluesky/tiled, WebHDFS.

jatorre · 2024-04-09T14:57:36Z

So, hi all, let me jump in here, because I think we are going on the same direction...

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Or in a virtual way, you are thinking on creating a "virtual" parquet file with the contents of a geozarr. And then BigQuery using a external table (parquet virtual tables), you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

martindurant · 2024-04-09T15:12:37Z

Yes, this is totally doable, to make a python file-like where the buffers are fetched from zarr chunks, and the parquet thrift metadata are made dynamically at runtime (or, optionally, stored elsewhere, which might be the easier way to start).

BigQuery using a external table

How does one persuade bigquery to read from a python file? We can get pyarrow or even duckdb to do this, but ...

xarrays data encoded.

Meaning simple things like gzip/zstd compression?

martindurant · 2024-04-09T15:14:22Z

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

alxmrs · 2024-04-09T15:46:27Z

you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

Yes, this is the end game.

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

Yes — and the big unknown for me is the cloud architecture question. How do you build a proxy GCS bucket? I was imagining an elastic load balancer would play a key role.

jatorre · 2024-04-09T17:15:07Z

I am thinking less virtual in a way, what about the other no dynamic way.

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

As a static file.

alxmrs · 2024-04-10T04:04:46Z

This idea has a lot of merit. We did something similar over here: https://github.com/google/weather-tools/blob/main/weather_mv/README.md

The best place for the static approach (copying Zarr into Parquet) would be as an output sink for weather-mv.

In fact, this repo was inspired by this weather tool. The goal here is to find a way to provide a similar experience to the one offered by weather-mv bq without doubling storage or waiting for a large ETL job to finish.

alxmrs · 2024-04-10T04:10:54Z

So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

I have began thinking this is possible! It looks like there is room in Parquet’s thrift format to point to Zarr array blobs as the backing data.

I have some notes about this around here: #4 (comment)

martindurant · 2024-04-10T13:39:01Z

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

You can store the thrift serialised objects in file(s) rather than make them on the fly; but no, parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

martindurant · 2024-04-10T13:42:11Z

Also, I was chatting about this yesterday: it is totally possible to store N-D arrays in parquet despite its columnar nature: a 3D array would be list(list(number)), and the required rep/def levels would be easy to generate, as they would be regularly repeating. In V2 pages, the buffer is compressed separately from the rep/defs, so streaming data from chunks would work fine. The question is, do any parquet frameworks expect this sort of data?

martindurant · 2024-04-10T13:43:38Z

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

jatorre · 2024-04-10T13:59:28Z

I am a bit lost here since I thought we would be able to store the arrays as arrow. Not the same but geoarrow is doing this for vector data. I assumed using nested arrays in arrow was going to be no problem.

…

On Wed, 10 Apr 2024 at 06:44, Martin Durant ***@***.***> wrote: Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7GO4PISGMYN65MOPZ2G3Y4U62BAVCNFSM6AAAAABCGQTZDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBXGU4DKMBQHE> . You are receiving this because you commented.Message ID: ***@***.***>

martindurant · 2024-04-10T14:04:56Z

I am a bit lost here since I thought we would be able to store the arrays
as arrow.

What do you mean by "arrow" here? Parquet is not arrow; feather2 essentially is.

Not the same but geoarrow is doing this for vector data. I assumed using
nested arrays in arrow was going to be no problem.

Yes, the parquet format can store nested data, which would include geo structures (like list of points -> polygon, for each row). But this is presumably not what you have in zarr, right? Zarr does not have a concept of variable-length structures in each array position, except for string, JSON and pickle encoding (i.e., python objects, not what you want)

So: please show me some example zarr data you'd like to view, and I'll see if I can make a python file-like to hand to pyarrow. But I'll be using fastparquet to make the thrift, since I don't believe there's any other way.

alxmrs · 2024-04-13T08:45:50Z

parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

Thanks Martin. That saves me from having to discover this first hand! Do you know of any format that would be a good fit, that allows for extra file storage? Feather, Avro, ORC, maybe?

The question is, do any parquet frameworks expect this sort of data?

Yes, this is really interesting! I agree, though, I don't see a use case for it for me in the near term.

I assumed using nested arrays in arrow was going to be no problem.

In a larger context, I see this work as largely experimental. It very well could not pan out as I find more information! Take these plans with a grain of salt.

martindurant · 2024-04-18T19:13:36Z

Do you know of any format that would be a good fit, that allows for extra file storage?

Sorry, no

alxmrs · 2024-05-26T11:58:57Z

Given zarr-developers/zarr-specs#287, it looks like future versions of Zarr will allow an fsspec interface beyond Python. It seems like a proxy server won’t be required.

alxmrs · 2024-06-27T21:36:57Z

Pathway to Postgres integration: https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse#overview

martindurant · 2024-06-28T00:38:20Z

You want postgres to query zarr data chunks with a parquet dynamic ephemeral intermediate representation? Actually, teaching duckDB about zarr might be simpler :) .

alxmrs · 2024-07-17T18:06:18Z

@TomNicholas I just saw your tweet related to “headless data” and Pangeo. https://x.com/TEGNicholasCode/status/1813299299201294583

I wanted to reply: I think something like “Zarrquet” could be a good fit for the SQL interoperability box for pangeo.

TomNicholas · 2024-07-17T18:12:09Z

Thanks for tagging me @alxmrs - I think this also relates closely to something @manzt and I were talking about at SciPy. Basically using SQL tools to analyze Zarr stores on a chunk-level

This was referenced Feb 17, 2024

Fugue integration #13

Closed

Zarr to Datafusion ListingTable Adapter #20

Open

Apache Iceberg Adapter #21

Open

alxmrs changed the title ~~Create a referenced based filesystem to read virtual parquet from Zarr.~~ Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. Mar 17, 2024

alxmrs mentioned this issue Apr 10, 2024

Writing to parquet (following kerchunk format) zarr-developers/VirtualiZarr#72

Closed

alxmrs mentioned this issue May 26, 2024

Milestone: Integration with DuckDB Spatial #55

Open

alxmrs mentioned this issue Oct 18, 2024

Xarray/GeoZarr integration ? geopolars/geopolars#213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

alxmrs commented Jan 23, 2024

alxmrs commented Jan 23, 2024

alxmrs commented Mar 17, 2024 •

edited

Loading

alxmrs commented Mar 17, 2024

alxmrs commented Mar 17, 2024

alxmrs commented Mar 18, 2024

alxmrs commented Mar 27, 2024

jatorre commented Apr 9, 2024

martindurant commented Apr 9, 2024

martindurant commented Apr 9, 2024

alxmrs commented Apr 9, 2024 •

edited

Loading

jatorre commented Apr 9, 2024

alxmrs commented Apr 10, 2024

alxmrs commented Apr 10, 2024

martindurant commented Apr 10, 2024

martindurant commented Apr 10, 2024

martindurant commented Apr 10, 2024

jatorre commented Apr 10, 2024 via email

martindurant commented Apr 10, 2024

alxmrs commented Apr 13, 2024

martindurant commented Apr 18, 2024

alxmrs commented May 26, 2024

alxmrs commented Jun 27, 2024

martindurant commented Jun 28, 2024

alxmrs commented Jul 17, 2024

TomNicholas commented Jul 17, 2024

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

Comments

alxmrs commented Jan 23, 2024

alxmrs commented Jan 23, 2024

alxmrs commented Mar 17, 2024 • edited Loading

alxmrs commented Mar 17, 2024

alxmrs commented Mar 17, 2024

alxmrs commented Mar 18, 2024

alxmrs commented Mar 27, 2024

jatorre commented Apr 9, 2024

martindurant commented Apr 9, 2024

martindurant commented Apr 9, 2024

alxmrs commented Apr 9, 2024 • edited Loading

jatorre commented Apr 9, 2024

alxmrs commented Apr 10, 2024

alxmrs commented Apr 10, 2024

martindurant commented Apr 10, 2024

martindurant commented Apr 10, 2024

martindurant commented Apr 10, 2024

jatorre commented Apr 10, 2024 via email

martindurant commented Apr 10, 2024

alxmrs commented Apr 13, 2024

martindurant commented Apr 18, 2024

alxmrs commented May 26, 2024

alxmrs commented Jun 27, 2024

martindurant commented Jun 28, 2024

alxmrs commented Jul 17, 2024

TomNicholas commented Jul 17, 2024

alxmrs commented Mar 17, 2024 •

edited

Loading

alxmrs commented Apr 9, 2024 •

edited

Loading