Read GeoParquet files using parquet reader #6508

weiji14 · 2023-12-18T04:50:37Z

Let GeoParquet files with the file extension *.geoparquet or *.gpq be readable by the default parquet reader.

Those two file extensions are the ones most commonly used for GeoParquet files, and is included in the gpq validator tool at https://github.com/planetlabs/gpq/blob/e5576b4ee7306b4d2259d56c879465a9364dab90/cmd/gpq/command/convert.go#L73-L75

Addresses #6438

Let GeoParquet files with the file extension `*.geoparquet` or `*.gpq` be readable by the default parquet reader.

HuggingFaceDocBuilderDev · 2023-12-18T10:36:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2023-12-21T15:56:14Z

Cool ! Do you mind writing a test using a geoparquet file in tests/io/test_parquet.py ?

I'm not too familiar with geoparquet, does it use e.g. pyarrow extension types ? or schema metadata ?

severo · 2023-12-21T21:29:32Z

Geometry columns MUST be stored using the BYTE_ARRAY parquet type. They MUST be encoded as WKB.

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#geometry-columns

It has metadata:

File metadata indicating things like the version of this specification used
Column metadata with additional metadata for each geometry column

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#metadata

severo · 2023-12-21T21:30:36Z

The specification is very short by the way:

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md

severo · 2023-12-21T21:33:05Z

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/compatible-parquet.md is also worth reading for this PR

weiji14 · 2023-12-21T22:49:43Z

Cool ! Do you mind writing a test using a geoparquet file in tests/io/test_parquet.py ?

Yep, let me do that do that later today!

I'm not too familiar with geoparquet, does it use e.g. pyarrow extension types ? or schema metadata ?

GeoParquet is a Parquet file with a geometry column that is encoded in a Binary format (technically WKB as @severo mentioned above). It is not a pyarrow extension type (though that would be cool). Regular parquet readers such as pyarrow would thus see the column as a binary column, while libraries such as geopandas which implement a GeoParquet reader would look at the schema metadata.

E.g. taking this file as an example, this is how the 'geo' schema looks like:

import pyarrow.parquet as pq

schema = pq.read_schema(where="32VLM_v01.gpq")
print(schema.metadata[b"geo"])

{
    "primary_column": "geometry",
    "columns": {
        "geometry": {
            "encoding": "WKB",
            "crs": {
                "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
                "type": "GeographicCRS",
                "name": "WGS 84 (CRS84)",
                "datum_ensemble": {
                    "name": "World Geodetic System 1984 ensemble",
                    "members": [
                        {"name": "World Geodetic System 1984 (Transit)"},
                        {"name": "World Geodetic System 1984 (G730)"},
                        {"name": "World Geodetic System 1984 (G873)"},
                        {"name": "World Geodetic System 1984 (G1150)"},
                        {"name": "World Geodetic System 1984 (G1674)"},
                        {"name": "World Geodetic System 1984 (G1762)"},
                        {"name": "World Geodetic System 1984 (G2139)"},
                    ],
                    "ellipsoid": {
                        "name": "WGS 84",
                        "semi_major_axis": 6378137,
                        "inverse_flattening": 298.257223563,
                    },
                    "accuracy": "2.0",
                    "id": {"authority": "EPSG", "code": 6326},
                },
                "coordinate_system": {
                    "subtype": "ellipsoidal",
                    "axis": [
                        {
                            "name": "Geodetic longitude",
                            "abbreviation": "Lon",
                            "direction": "east",
                            "unit": "degree",
                        },
                        {
                            "name": "Geodetic latitude",
                            "abbreviation": "Lat",
                            "direction": "north",
                            "unit": "degree",
                        },
                    ],
                },
                "scope": "Not known.",
                "area": "World.",
                "bbox": {
                    "south_latitude": -90,
                    "west_longitude": -180,
                    "north_latitude": 90,
                    "east_longitude": 180,
                },
                "id": {"authority": "OGC", "code": "CRS84"},
            },
            "geometry_types": ["Polygon"],
            "bbox": [
                5.370542846111244,
                59.42344573656881,
                7.368267282586697,
                60.42591328670696,
            ],
        }
    },
    "version": "1.0.0",
    "creator": {"library": "geopandas", "version": "0.14.1"},
}

We can continue the discussion on how to handle this extra 'geo' schema metadata in #6438. I'd like to keep this PR small by just piggy-backing off the default Parquet reader for now, which would just show the 'geometry' column as a binary column.

Getting a sample GeoParquet file from https://github.com/opengeospatial/geoparquet/raw/v1.0.0/examples/example.parquet, saving it with a .geoparquet extension, and try to load it back, checking that column dtypes are correct. Also decided to put .geoparquet and .gpq in the _EXTENSION_TO_MODULE dictionary directly.

lhoestq · 2024-01-02T15:40:31Z

Thanks ! Also if you can make sure that doing ds.to_parquet("path/to/output.geoparquet") also saves as a valid geoparquet files (including the schema metadata) that would be awesome.

It would also enable push_to_hub to save geoparquet files

weiji14 · 2024-01-08T22:07:49Z

Thanks ! Also if you can make sure that doing ds.to_parquet("path/to/output.geoparquet") also saves as a valid geoparquet files (including the schema metadata) that would be awesome.

It would also enable push_to_hub to save geoparquet files

Hmm, it should be possible to let PyArrow save a Parquet file with a geometry WKB column, but saving the GeoParquet schema metadata won't be easy without introducing geopandas as a dependency. Does this need to be done in this PR, or can it be a separate one?

lhoestq · 2024-01-09T11:04:04Z

I see, then let's keep it like this for now.
I just checked and it would require to add support for keeping the schema metadata in Features anyway.

Feel free to fix your code formatting using

make style

and we can merge this PR :)

weiji14 · 2024-01-11T01:11:44Z

Cool, linted to remove the extra blank line at 7088f58. 🚀

weiji14 · 2024-01-25T19:42:37Z

The previous CI failure at https://github.com/huggingface/datasets/actions/runs/7482863299/job/20668381959#step:6:5299 says datasets.exceptions.DefunctDatasetError: Dataset 'eli5' is defunct and no longer accessible due to unavailability of the source data which seems unrelated, might be to do with #6605. I've updated the PR branch with changes from main again, could someone re-run the tests and merge if ok? Thanks!

lhoestq · 2024-01-26T15:56:55Z

sorry, it took me some time to fix the CI on the main branch

will merge once it's green :)

github-actions · 2024-01-26T16:24:59Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005467 / 0.011353 (-0.005886)	0.003696 / 0.011008 (-0.007313)	0.063298 / 0.038508 (0.024790)	0.032209 / 0.023109 (0.009100)	0.246307 / 0.275898 (-0.029591)	0.276864 / 0.323480 (-0.046616)	0.003941 / 0.007986 (-0.004044)	0.002616 / 0.004328 (-0.001713)	0.049543 / 0.004250 (0.045292)	0.044886 / 0.037052 (0.007833)	0.266502 / 0.258489 (0.008013)	0.288401 / 0.293841 (-0.005440)	0.027911 / 0.128546 (-0.100635)	0.011011 / 0.075646 (-0.064636)	0.207972 / 0.419271 (-0.211299)	0.036324 / 0.043533 (-0.007209)	0.259450 / 0.255139 (0.004311)	0.267317 / 0.283200 (-0.015883)	0.018857 / 0.141683 (-0.122826)	1.145350 / 1.452155 (-0.306805)	1.204204 / 1.492716 (-0.288513)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.103864 / 0.018006 (0.085858)	0.306941 / 0.000490 (0.306451)	0.000218 / 0.000200 (0.000018)	0.000044 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018391 / 0.037411 (-0.019020)	0.064600 / 0.014526 (0.050074)	0.075454 / 0.176557 (-0.101102)	0.120913 / 0.737135 (-0.616223)	0.076998 / 0.296338 (-0.219341)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279491 / 0.215209 (0.064282)	2.742471 / 2.077655 (0.664816)	1.447980 / 1.504120 (-0.056140)	1.328202 / 1.541195 (-0.212992)	1.397291 / 1.468490 (-0.071199)	0.585726 / 4.584777 (-3.999051)	2.385132 / 3.745712 (-1.360580)	2.874888 / 5.269862 (-2.394974)	1.820177 / 4.565676 (-2.745500)	0.063876 / 0.424275 (-0.360399)	0.004946 / 0.007607 (-0.002661)	0.336445 / 0.226044 (0.110401)	3.396813 / 2.268929 (1.127885)	1.832644 / 55.444624 (-53.611981)	1.581304 / 6.876477 (-5.295172)	1.607243 / 2.142072 (-0.534829)	0.662752 / 4.805227 (-4.142476)	0.119494 / 6.500664 (-6.381170)	0.042573 / 0.075469 (-0.032896)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.936784 / 1.841788 (-0.905003)	12.154288 / 8.074308 (4.079980)	10.944835 / 10.191392 (0.753443)	0.132856 / 0.680424 (-0.547568)	0.015197 / 0.534201 (-0.519004)	0.290647 / 0.579283 (-0.288636)	0.273498 / 0.434364 (-0.160866)	0.324893 / 0.540337 (-0.215444)	0.427905 / 1.386936 (-0.959032)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005695 / 0.011353 (-0.005658)	0.003562 / 0.011008 (-0.007446)	0.050117 / 0.038508 (0.011608)	0.033876 / 0.023109 (0.010767)	0.275514 / 0.275898 (-0.000384)	0.298460 / 0.323480 (-0.025020)	0.004240 / 0.007986 (-0.003745)	0.002738 / 0.004328 (-0.001591)	0.048518 / 0.004250 (0.044268)	0.049064 / 0.037052 (0.012012)	0.287094 / 0.258489 (0.028605)	0.314281 / 0.293841 (0.020440)	0.057861 / 0.128546 (-0.070686)	0.010893 / 0.075646 (-0.064753)	0.062251 / 0.419271 (-0.357020)	0.036788 / 0.043533 (-0.006745)	0.272431 / 0.255139 (0.017292)	0.292022 / 0.283200 (0.008822)	0.019874 / 0.141683 (-0.121809)	1.156939 / 1.452155 (-0.295216)	1.237966 / 1.492716 (-0.254751)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096156 / 0.018006 (0.078150)	0.306652 / 0.000490 (0.306162)	0.000230 / 0.000200 (0.000031)	0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022965 / 0.037411 (-0.014447)	0.081349 / 0.014526 (0.066823)	0.089035 / 0.176557 (-0.087521)	0.128831 / 0.737135 (-0.608304)	0.090321 / 0.296338 (-0.206017)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.293110 / 0.215209 (0.077901)	2.884493 / 2.077655 (0.806839)	1.582522 / 1.504120 (0.078402)	1.518977 / 1.541195 (-0.022218)	1.528449 / 1.468490 (0.059959)	0.577369 / 4.584777 (-4.007408)	2.473060 / 3.745712 (-1.272652)	3.104363 / 5.269862 (-2.165499)	1.916529 / 4.565676 (-2.649147)	0.064594 / 0.424275 (-0.359682)	0.005386 / 0.007607 (-0.002221)	0.353336 / 0.226044 (0.127292)	3.471914 / 2.268929 (1.202985)	1.959222 / 55.444624 (-53.485402)	1.677153 / 6.876477 (-5.199324)	1.716961 / 2.142072 (-0.425112)	0.658355 / 4.805227 (-4.146873)	0.117296 / 6.500664 (-6.383368)	0.041139 / 0.075469 (-0.034330)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.025220 / 1.841788 (-0.816567)	14.510987 / 8.074308 (6.436679)	11.851428 / 10.191392 (1.660036)	0.143759 / 0.680424 (-0.536665)	0.015644 / 0.534201 (-0.518557)	0.296824 / 0.579283 (-0.282459)	0.281566 / 0.434364 (-0.152798)	0.335094 / 0.540337 (-0.205244)	0.425199 / 1.386936 (-0.961737)

Read GeoParquet files using parquet reader

f35ac18

Let GeoParquet files with the file extension `*.geoparquet` or `*.gpq` be readable by the default parquet reader.

weiji14 marked this pull request as ready for review December 18, 2023 05:00

weiji14 mentioned this pull request Dec 18, 2023

Support GeoParquet #6438

Open

Fix typo curly and round brackets

88c05c9

weiji14 added 2 commits December 22, 2023 11:59

Merge branch 'main' into geoparquet

81a2770

weiji14 added 2 commits January 11, 2024 14:07

Merge branch 'main' into geoparquet

0d61c48

Lint

7088f58

Merge branch 'main' into geoparquet

53dc08d

Merge branch 'main' into geoparquet

324e8d5

lhoestq merged commit fabc2c8 into huggingface:main Jan 26, 2024
12 checks passed

weiji14 deleted the geoparquet branch January 26, 2024 18:22

severo mentioned this pull request Feb 9, 2024

Support vectorial geospatial columns huggingface/dataset-viewer#2416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read GeoParquet files using parquet reader #6508

Read GeoParquet files using parquet reader #6508

weiji14 commented Dec 18, 2023

HuggingFaceDocBuilderDev commented Dec 18, 2023

lhoestq commented Dec 21, 2023

severo commented Dec 21, 2023

severo commented Dec 21, 2023

severo commented Dec 21, 2023

weiji14 commented Dec 21, 2023 •

edited

Loading

lhoestq commented Jan 2, 2024

weiji14 commented Jan 8, 2024

lhoestq commented Jan 9, 2024

weiji14 commented Jan 11, 2024

weiji14 commented Jan 25, 2024

lhoestq commented Jan 26, 2024

github-actions bot commented Jan 26, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Read GeoParquet files using parquet reader #6508

Read GeoParquet files using parquet reader #6508

Conversation

weiji14 commented Dec 18, 2023

HuggingFaceDocBuilderDev commented Dec 18, 2023

lhoestq commented Dec 21, 2023

severo commented Dec 21, 2023

severo commented Dec 21, 2023

severo commented Dec 21, 2023

weiji14 commented Dec 21, 2023 • edited Loading

lhoestq commented Jan 2, 2024

weiji14 commented Jan 8, 2024

lhoestq commented Jan 9, 2024

weiji14 commented Jan 11, 2024

weiji14 commented Jan 25, 2024

lhoestq commented Jan 26, 2024

github-actions bot commented Jan 26, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

weiji14 commented Dec 21, 2023 •

edited

Loading