Skip to content

Commit

Permalink
linted
Browse files Browse the repository at this point in the history
  • Loading branch information
cholmes committed Jan 14, 2025
1 parent 1a796db commit 83f9876
Showing 1 changed file with 13 additions and 16 deletions.
29 changes: 13 additions & 16 deletions format-specs/distributing-geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This guide aims to encapsulate a number of best practices that the community has
started to align on for making 'good' GeoParquet files, especially for distribution
of data. Parquet gives users lots of different options, and the defaults of various
libraries are different and usually not optimized for geospatial data.
libraries are different and usually not optimized for geospatial data.

## tl;dr Recommendations

Expand Down Expand Up @@ -58,8 +58,8 @@ all the spatial data for an area in chunks, or if data for the whole are appears

<pictures>

GeoParquet itself does not have a specific spatial index like other formats (R-tree in GeoPackage, Packed Hilbert R-tree in
FlatGeobuf). Instead data can be indexed in any way, and then Parquet's Row Group statistics will be used to speed up spatial
GeoParquet itself does not have a specific spatial index like other formats (R-tree in GeoPackage, Packed Hilbert R-tree in
FlatGeobuf). Instead data can be indexed in any way, and then Parquet's Row Group statistics will be used to speed up spatial
queries (when using bbox covering or native arrow types). Most tools that provide GeoParquet writers have some ability to apply a spatial index, the examples below will show how to do this for a few common tools.

### Row Group Size
Expand All @@ -70,9 +70,9 @@ get this right, since it will impact the performance of spatial queries. If the
reader will not be able to 'skip' over large chunks of data, and if it's too small then the file metadata can get quite large,
which can really slow things down if there are a lot of files.

Unfortunately there's no single 'best' size for row groups, and it will depend on the size of the data and the access patterns.
Unfortunately there's no single 'best' size for row groups, and it will depend on the size of the data and the access patterns.
And the community is still learning what works best, so there's no single recommendation - hopefully we'll learn more and update
this section in the future. But right now most of the larger global datasets are being distributed with row group sizes of 100,000 to 200,000 rows, so that's what we recommend as a starting point.
this section in the future. But right now most of the larger global datasets are being distributed with row group sizes of 100,000 to 200,000 rows, so that's what we recommend as a starting point.

Most geospatial tools give you the ability to set the maximum number of rows per row group, but other tools may let you set
the byte size for the row group. The core thing that really matters is the byte size for the row group, as that will be
Expand All @@ -84,22 +84,22 @@ larger end of the spectrum.

One of the coolest features of Parquet is the ability to partition a large dataset into multiple files, as most every reader
can be pointed at a folder of files and it will read them as a single dataset. The reader will use the row group statistics
to quickly figure out if a given file needs to be read, and multiple files can be read in parallel. So with spatial data,
where most every query contains a spatial filter, partioning the data spatially can greatly accelerate the performance.
to quickly figure out if a given file needs to be read, and multiple files can be read in parallel. So with spatial data,
where most every query contains a spatial filter, partioning the data spatially can greatly accelerate the performance.

Similar to the row group size, the community is still figuring out the best way to spatially partition the data, and the
overall query performance will depend on both row group size and the size of the partitioned files, along with the nature of
the data. Hopefully someone will do a set of robust testing to help inform more definitive recommendations.

For now the recommendation is to spatially partition your data 'in some way', at least if the dataset is larger than a couple
gigabytes. If it's smaller than that then the additional overhead of splitting it up is likely not worth it. There was some
[great discussion](https://github.com/opengeospatial/geoparquet/discussions/251) on the topic, and an nice
[great discussion](https://github.com/opengeospatial/geoparquet/discussions/251) on the topic, and an nice
[blog post](https://dewey.dunnington.ca/post/2024/partitioning-strategies-for-bigger-than-memory-spatial-data/) with some
further experimentation. The leading approach at the moment is to use a K-dimensional tree (KD-tree), which will enable
nice balancing of the file sizes, but sorts based on S2, GeoHash or R-tree can all work. And partitioning [based on admin
boundaries](https://medium.com/radiant-earth-insights/the-admin-partitioned-geoparquet-distribution-59f0ca1c6d96) is another
approach that works, used in the [Google-Microsoft-OSM Buildings - combined by VIDA](https://source.coop/repositories/vida/google-microsoft-osm-open-buildings/description)
dataset.
dataset.

### Use STAC metadata

Expand Down Expand Up @@ -127,7 +127,7 @@ for more details.
These datasets are all 'good enough' to use, but don't quite follow all the recommendations above. Once they are updated we'll
move them up.

* The [Google-Microsoft-OSM Buildings - combined by VIDA](https://source.coop/repositories/vida/google-microsoft-osm-open-buildings/description) is a great example of a dataset that is almost following all the recommendations above. They did use snappy, and
* The [Google-Microsoft-OSM Buildings - combined by VIDA](https://source.coop/repositories/vida/google-microsoft-osm-open-buildings/description) is a great example of a dataset that is almost following all the recommendations above. They did use snappy, and
their row group sizes are around 5,000 (which still gets reasonable performance). They distribute the data in 2 different
partition schemes. One is just by admin boundary, which leads to a few really large files (India, USA, etc). The other further
splits larger countries into smaller files, using S2 cells.
Expand All @@ -136,25 +136,22 @@ splits larger countries into smaller files, using S2 cells.
Wherobots.

* [Planet Ag Field Boundaries over EU](https://source.coop/repositories/planet/eu-field-boundaries/description) - needs to be
spatially partitioned, row group size is 25,000.
spatially partitioned, row group size is 25,000.

## Examples in common tools

TODO: This section should discuss what each tool does by default, and show any additional options needed to follow
the recommendations above. Likely will make sense to discuss spatial partitioning in a separate section, since right
now no tools do it out of the box.
now no tools do it out of the box.

### GDAL/OGR

### GeoPandas (Python)

### DuckDB

### Sedona
### Sedona

### GPQ (Go)

TODO: add more tools.



0 comments on commit 83f9876

Please sign in to comment.