Best Practices for Partitioning GeoParquet? #171

cholmes · 2023-06-06T01:26:58Z

cholmes
Jun 6, 2023
Maintainer

I've recently been working with Google's Open Buildings Dataset, exploring the creation of a more cloud-native distribution of it than what Google currently offers, likely to put up on Source Cooperative. I've got a PMTiles of the whole dataset, and then I'm also wanting to offer it as GeoParquet.

My ideal is to let people treat it as a single file, but then have it partitioned so people can also just download / work with say a single country. So was looking for advice / best practices on how to split it up. I think the best example of this thus far is Microsoft Buildings on Planetary Computer So I'm mostly wondering if you could share how that's done @TomAugspurger? How does the delta file work? How did divide up the underlying files? How did you create the delta file? etc. Also curious about delta vs iceberg vs just using parquet?

I'm also happy for anyone else who has broken up large datasets to sound in with what you did. Hopefully we can get some best practices and figure out how to make this easy for people.

The Google Buildings is broken up into S2 chunks, but it does seem like it'd be more useful to have it based on countries. And then maybe break it up further if the amount of data in a country is still quite big? Like I think Nigeria is going to be huge.

TomAugspurger · 2023-06-06T02:01:50Z

TomAugspurger
Jun 6, 2023
Maintainer

So I'm mostly wondering if you could share how that's done @TomAugspurger?

Fortunately, I didn't have to do that for the version we published through the Planetary Computer. The raw data was in some sort of Spark system the team creating this dataset uses. They exported it to a partitioned parquet dataset. They were able to partition the original dataset by quadkey (something like df.write.partitionBy("region", "quadkey").parquet("output.parquet"), which gives the table partitioning described here. Notably, their raw data already had the region name and quadkey, which enabled partitioning on those columns. We didn't have to derive it from the polygons.

If I did have to do this myself, I would use Dask. Roughly something like this

df = pd.read_csv("data/*.csv", include_path_column=True) 
# do some regex stuff to extract the S2 chunk ID out of the path
df.to_parquet("data/output.parquet", partition_on="s2_key")

but that would need some tweaking to get it to work. Happy to help out with this if needed.

I think @kylebarron has some code at https://github.com/kylebarron/spatially-partitioned-geoparquet looked into this a bit.

My ideal is to let people treat it as a single file, but then have it partitioned so people can also just download / work with say a single country.

I agree this is the most convenient way to access the data (though slight nitpick: I'd say "single parquet dataset" rather than "single file", since it'll be partitioned into many files). Systems like Dask, Spark, Synapse, etc. will let you treat the partitioned dataset as a single logical table.

How does the delta file work? How did divide up the underlying files? How did you create the delta file? etc.

I don't really understand delta or Synapse, but using it was pretty straightforward:

# raw data from my teammate
df = spark.read.parquet(
    "abfss://[email protected]/global/2023-04-25/ml-buildings.parquet/"
)
# rewritten as delta
df.write.format("delta").partitionBy(["RegionName", "quadkey"]).save(
    "/delta/2023-04-25/ml-buildings.parquet/"
)

That spit out the delta format files (which I think is just some kind of JSON metadata file at the root plus the parquet files).

How did divide up the underlying files?

We already had good partitioning (at least one file per (region, quadkey); very dense quadkeys were split to keep the files below ~200 MB), so I don't think delta did anything to repartition the files. IIUC, delta was just adding a metadata layer on top.

Also curious about delta vs iceberg vs just using parquet?

No thoughts on delta vs. iceberg (Synapse has good built-in support for delta, so I used that).

As for delta / iceberg vs. plain parquet, I think that some kind of catalog system can be important. For systems like spark / dask, a read_parquet("/path/to/dataset.parquet") can be slow if there are very many files under the root. My teammates had a use-case where that listing was too slow. But with the delta format, they can query on the partitioned columns and get super quick access to the matching files. It was pretty impressive just how fast things can be with the right metadata to enable the filtering.

Interestingly, you can also (or alternatively) make STAC items for the individual parquet files and get a similar access pattern (see the bottom of https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook), but that's another can of worms.

Finally, it's worth mentioning that because Synapse / spark / delta don't support writing geoparquet metadata (AFAIK) I have to manually add that. I've been meaning to publish this as a package, but this gist has a snippet for adding geoparquet metadata to a parquet file.

11 replies

cholmes Jun 9, 2023
Maintainer Author

Hey @Maxxen - I'm digging in, and first impressions are awesome! Nice interface and really fast. I'm not yet really to the parquet partitioning stuff, mostly am just trying to flex some of the geospatial aspects to get a sense of how I could do things. What's the best forum to ask questions about that? I'm getting some weird performance stuff, like memory going to 60+ gigabytes on two tables that are under 200mb each, using what seems to be a fairly standard spatial join.

cholmes Jun 9, 2023
Maintainer Author

I got some first results of partitioning on country and admin-1 with DuckDB, but I'm a bit confused by them.

It did the country and admin level as I expected, but then each admin often has up to 11 files, most of which are 161 bytes - sometimes it has the 161 byte ones, sometimes it just has a handful of the data_4.parquet type files, skipping numbers. If I load a 161 byte one in duck it seems to have the schema but no rows:

D select * from 'data_2.parquet';
┌────────────────┬────────────┬────────────────┬─────────────┬─────────┬──────────┐
│ area_in_meters │ confidence │ full_plus_code │ country_iso │ admin_1 │ geometry │
│     double     │   double   │    varchar     │   varchar   │ varchar │   blob   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                     0 rows                                      │
└─────────────────────────────────────────────────────────────────────────────────┘

I'm not sure what it's attempting to divide the data within the file by, and was hoping it'd just be just one file in them (perhaps unless it was a really big file). Any idea @Maxxen?

My ideal would actually be to have folders for the countries, and then partitions by admin_1, but to have each data file be at least 100mb, and so usually just one file. And even more ideal would be for the name of the file to be the admin_1 value, but I didn't see a way to do that.

I'm suspecting that my overall goal may not be that compatible with how these partitioning things work. My ideal would be that I could partition into sensible boundaries, like countries and states/provinces. And smart tools could treat the dataset as a single table, but other tools could just navigate to the individual files that users would like. To have a new way of distributing global data, where it'd be partitioned for tools but each piece would be a sensible chunk for people to use. Which is why it would be nice to have the data named something like TUN_Tunis.parquet

cholmes Jun 12, 2023
Maintainer Author

Found something as to why it might be writing 11 files, it says 'Currently, one file is written per thread to each directory.' in https://duckdb.org/docs/data/partitioning/partitioned_writes.html

Maxxen Jun 12, 2023

Hi! I just fixed a memory leak last week which is probably why you're seeing such a high memory usage on joins. We're about to issue a bugfix release today which is going to include a bunch of other fixes and tweaks for the spatial extension as well.

Regarding the partitioning, perhaps I spoke too soon, If you already have the geometries labeled by some partitioning scheme and simply wanted to repartition them by some other column - that should be easy with duckdb regardless of how big the underlying dataset is. But if you want to group by some spatial extent and perform a spatial join, its probably not going to be that fast right now since duckdbs spatial capabilities doesn't use any spatial indexes or geometry-aware joins yet. We basically have to do a n*m loop. Although that might still be acceptable depending on how large each side is, and it should still handle larger-than-memory execution (once the bugfix is released) even if it's less efficient.

I can't really say why you're ending up with the empty data files, but the threads being too aggressive when initialising the partitions are probably a good guess, I've forwarded it as an issue internally and we've been able to reproduce it. We're working on "streaming" partitioning as well so that you could set a limit (e.g. 100mb) before a new datafile is created, but for now you could also set the thread limit with PRAGMA threads=1 before writing to get one datafile per partition.

cholmes Jun 12, 2023
Maintainer Author

Hi! I just fixed a memory leak last week which is probably why you're seeing such a high memory usage on joins. We're about to issue a bugfix release today which is going to include a bunch of other fixes and tweaks for the spatial extension as well.

Awesome! My favorite type of bug - one that's already been fixed :) Will check it out when it's released.

Regarding the partitioning, perhaps I spoke too soon, If you already have the geometries labeled by some partitioning scheme and simply wanted to repartition them by some other column - that should be easy with duckdb regardless of how big the underlying dataset is. But if you want to group by some spatial extent and perform a spatial join, its probably not going to be that fast right now since duckdbs spatial capabilities doesn't use any spatial indexes or geometry-aware joins yet.

Ah cool - that was going to be one of my questions, if there was a spatial index yet. Are there plans for it? If you had that plus the ability to write out geoparquet then duckdb will be an incredible tool for the geospatial world, and would likely be my go to tool for the things I'm doing now. I just ended up doing my spatial join in PostGIS and then did the partitioning in DuckDB and it worked great. But I already like working with duckdb much more than postgresql :) It also just blew away using geopandas to read in a big (10 gig) shapefile and write out a flatgeobuf. Did it in a 20th the time and was way easier on my memory.

I can't really say why you're ending up with the empty data files, but the threads being too aggressive when initialising the partitions are probably a good guess, I've forwarded it as an issue internally and we've been able to reproduce it.

Great, thanks!

We're working on "streaming" partitioning as well so that you could set a limit (e.g. 100mb) before a new datafile is created, but for now you could also set the thread limit with PRAGMA threads=1 before writing to get one datafile per partition.

Streaming partitions with a limit sounds great.

cholmes · 2023-06-26T18:59:41Z

cholmes
Jun 26, 2023
Maintainer Author

Ok, I've finally got something to show on this. Thanks for all the help @TomAugspurger and @Maxxen
I put my results so far up on source.coop, see https://beta.source.coop/cholmes/google-open-buildings It's a little browser on top of the S3 bucket at s3://us-west-2.opendata.source.coop/google-research-open-buildings/
So far there's 2 sets of the data in GeoParquet:

The first (geoparquet/ folder just translates the gzipped csv files into geoparquet (with a bit of clean-up - removing the long/lat columns and splitting multi-polys so it can be all polygon type). This is mostly just to make the original data slightly more accessible, for any experiments in partitioning, etc -
The second (geoparquet-admin1/ folder I added fields for country (3 character iso field) and admin 1, using the geoBoundaries dataset. And then I partitioned them using the country field, and named them by their admin_1 field.

The idea of naming them meaningfully appeals to me to make it so the folders could be used in 'download to desktop' type traditional workflows. I also put up a little tutorial on GeoParquet and DuckDB as it made it quite easy to do things like just download a whole country of Parquet files and put in a common GIS format. The httpfs extension made it really easy for someone not all up on S3 to grab the files, and the spatial extension made it easy to write out to lots of different formats.

GDAL/OGR seemed able to read the custom named files as a partition just fine - I'd be curious if people could test with other tools. I also have no idea if the partitions actually help in the querying all that much.

If others are interested in experimenting with more partitions it should be a solid dataset to work with, at ~60 gigs. My next steps are to try out using iceberg or delta on top of the partitions, and to perhaps try out pure spatial partitioning using Sedona. Not sure when I'll find the time though, so others are more than welcome to try it out as well. If you get interesting results and want to host them on source.coop let me know and I'm sure we can get it up there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practices for Partitioning GeoParquet? #171

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best Practices for Partitioning GeoParquet? #171

cholmes Jun 6, 2023 Maintainer

Replies: 2 comments · 11 replies

TomAugspurger Jun 6, 2023 Maintainer

cholmes Jun 9, 2023 Maintainer Author

cholmes Jun 9, 2023 Maintainer Author

cholmes Jun 12, 2023 Maintainer Author

Maxxen Jun 12, 2023

cholmes Jun 12, 2023 Maintainer Author

cholmes Jun 26, 2023 Maintainer Author

cholmes
Jun 6, 2023
Maintainer

Replies: 2 comments 11 replies

TomAugspurger
Jun 6, 2023
Maintainer

cholmes Jun 9, 2023
Maintainer Author

cholmes Jun 9, 2023
Maintainer Author

cholmes Jun 12, 2023
Maintainer Author

cholmes Jun 12, 2023
Maintainer Author

cholmes
Jun 26, 2023
Maintainer Author