Replies: 2 comments 11 replies
-
Fortunately, I didn't have to do that for the version we published through the Planetary Computer. The raw data was in some sort of Spark system the team creating this dataset uses. They exported it to a partitioned parquet dataset. They were able to partition the original dataset by quadkey (something like If I did have to do this myself, I would use Dask. Roughly something like this df = pd.read_csv("data/*.csv", include_path_column=True)
# do some regex stuff to extract the S2 chunk ID out of the path
df.to_parquet("data/output.parquet", partition_on="s2_key") but that would need some tweaking to get it to work. Happy to help out with this if needed. I think @kylebarron has some code at https://github.com/kylebarron/spatially-partitioned-geoparquet looked into this a bit.
I agree this is the most convenient way to access the data (though slight nitpick: I'd say "single parquet dataset" rather than "single file", since it'll be partitioned into many files). Systems like Dask, Spark, Synapse, etc. will let you treat the partitioned dataset as a single logical table.
I don't really understand delta or Synapse, but using it was pretty straightforward: # raw data from my teammate
df = spark.read.parquet(
"abfss://[email protected]/global/2023-04-25/ml-buildings.parquet/"
)
# rewritten as delta
df.write.format("delta").partitionBy(["RegionName", "quadkey"]).save(
"/delta/2023-04-25/ml-buildings.parquet/"
) That spit out the delta format files (which I think is just some kind of JSON metadata file at the root plus the parquet files).
We already had good partitioning (at least one file per
No thoughts on delta vs. iceberg (Synapse has good built-in support for delta, so I used that). As for delta / iceberg vs. plain parquet, I think that some kind of catalog system can be important. For systems like spark / dask, a Interestingly, you can also (or alternatively) make STAC items for the individual parquet files and get a similar access pattern (see the bottom of https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook), but that's another can of worms. Finally, it's worth mentioning that because Synapse / spark / delta don't support writing geoparquet metadata (AFAIK) I have to manually add that. I've been meaning to publish this as a package, but this gist has a snippet for adding geoparquet metadata to a parquet file. |
Beta Was this translation helpful? Give feedback.
-
Ok, I've finally got something to show on this. Thanks for all the help @TomAugspurger and @Maxxen
The idea of naming them meaningfully appeals to me to make it so the folders could be used in 'download to desktop' type traditional workflows. I also put up a little tutorial on GeoParquet and DuckDB as it made it quite easy to do things like just download a whole country of Parquet files and put in a common GIS format. The httpfs extension made it really easy for someone not all up on S3 to grab the files, and the spatial extension made it easy to write out to lots of different formats. GDAL/OGR seemed able to read the custom named files as a partition just fine - I'd be curious if people could test with other tools. I also have no idea if the partitions actually help in the querying all that much. If others are interested in experimenting with more partitions it should be a solid dataset to work with, at ~60 gigs. My next steps are to try out using iceberg or delta on top of the partitions, and to perhaps try out pure spatial partitioning using Sedona. Not sure when I'll find the time though, so others are more than welcome to try it out as well. If you get interesting results and want to host them on source.coop let me know and I'm sure we can get it up there. |
Beta Was this translation helpful? Give feedback.
-
I've recently been working with Google's Open Buildings Dataset, exploring the creation of a more cloud-native distribution of it than what Google currently offers, likely to put up on Source Cooperative. I've got a PMTiles of the whole dataset, and then I'm also wanting to offer it as GeoParquet.
My ideal is to let people treat it as a single file, but then have it partitioned so people can also just download / work with say a single country. So was looking for advice / best practices on how to split it up. I think the best example of this thus far is Microsoft Buildings on Planetary Computer So I'm mostly wondering if you could share how that's done @TomAugspurger? How does the delta file work? How did divide up the underlying files? How did you create the delta file? etc. Also curious about delta vs iceberg vs just using parquet?
I'm also happy for anyone else who has broken up large datasets to sound in with what you did. Hopefully we can get some best practices and figure out how to make this easy for people.
The Google Buildings is broken up into S2 chunks, but it does seem like it'd be more useful to have it based on countries. And then maybe break it up further if the amount of data in a country is still quite big? Like I think Nigeria is going to be huge.
Beta Was this translation helpful? Give feedback.
All reactions