-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read GeoParquet files using parquet reader #6508
Conversation
Let GeoParquet files with the file extension `*.geoparquet` or `*.gpq` be readable by the default parquet reader.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cool ! Do you mind writing a test using a geoparquet file in I'm not too familiar with geoparquet, does it use e.g. pyarrow extension types ? or schema metadata ? |
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#geometry-columns It has metadata:
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#metadata |
The specification is very short by the way: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md |
https://github.com/opengeospatial/geoparquet/blob/main/format-specs/compatible-parquet.md is also worth reading for this PR |
Yep, let me do that do that later today!
GeoParquet is a Parquet file with a E.g. taking this file as an example, this is how the 'geo' schema looks like: import pyarrow.parquet as pq
schema = pq.read_schema(where="32VLM_v01.gpq")
print(schema.metadata[b"geo"])
We can continue the discussion on how to handle this extra 'geo' schema metadata in #6438. I'd like to keep this PR small by just piggy-backing off the default Parquet reader for now, which would just show the 'geometry' column as a binary column. |
Getting a sample GeoParquet file from https://github.com/opengeospatial/geoparquet/raw/v1.0.0/examples/example.parquet, saving it with a .geoparquet extension, and try to load it back, checking that column dtypes are correct. Also decided to put .geoparquet and .gpq in the _EXTENSION_TO_MODULE dictionary directly.
Thanks ! Also if you can make sure that doing It would also enable |
Hmm, it should be possible to let PyArrow save a Parquet file with a geometry WKB column, but saving the GeoParquet schema metadata won't be easy without introducing |
I see, then let's keep it like this for now. Feel free to fix your code formatting using
and we can merge this PR :) |
Cool, linted to remove the extra blank line at 7088f58. 🚀 |
The previous CI failure at https://github.com/huggingface/datasets/actions/runs/7482863299/job/20668381959#step:6:5299 says |
sorry, it took me some time to fix the CI on the will merge once it's green :) |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Let GeoParquet files with the file extension
*.geoparquet
or*.gpq
be readable by the default parquet reader.Those two file extensions are the ones most commonly used for GeoParquet files, and is included in the
gpq
validator tool at https://github.com/planetlabs/gpq/blob/e5576b4ee7306b4d2259d56c879465a9364dab90/cmd/gpq/command/convert.go#L73-L75Addresses #6438