Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read GeoParquet files using parquet reader #6508

Merged
merged 8 commits into from
Jan 26, 2024

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 18, 2023

Let GeoParquet files with the file extension *.geoparquet or *.gpq be readable by the default parquet reader.

Those two file extensions are the ones most commonly used for GeoParquet files, and is included in the gpq validator tool at https://github.com/planetlabs/gpq/blob/e5576b4ee7306b4d2259d56c879465a9364dab90/cmd/gpq/command/convert.go#L73-L75

Addresses #6438

Let GeoParquet files with the file extension `*.geoparquet` or `*.gpq` be readable by the default parquet reader.
@weiji14 weiji14 marked this pull request as ready for review December 18, 2023 05:00
@weiji14 weiji14 mentioned this pull request Dec 18, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Dec 21, 2023

Cool ! Do you mind writing a test using a geoparquet file in tests/io/test_parquet.py ?

I'm not too familiar with geoparquet, does it use e.g. pyarrow extension types ? or schema metadata ?

@severo
Copy link
Collaborator

severo commented Dec 21, 2023

Geometry columns MUST be stored using the BYTE_ARRAY parquet type. They MUST be encoded as WKB.

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#geometry-columns

It has metadata:

File metadata indicating things like the version of this specification used
Column metadata with additional metadata for each geometry column

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#metadata

@severo
Copy link
Collaborator

severo commented Dec 21, 2023

The specification is very short by the way:

https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md

@severo
Copy link
Collaborator

severo commented Dec 21, 2023

@weiji14
Copy link
Contributor Author

weiji14 commented Dec 21, 2023

Cool ! Do you mind writing a test using a geoparquet file in tests/io/test_parquet.py ?

Yep, let me do that do that later today!

I'm not too familiar with geoparquet, does it use e.g. pyarrow extension types ? or schema metadata ?

GeoParquet is a Parquet file with a geometry column that is encoded in a Binary format (technically WKB as @severo mentioned above). It is not a pyarrow extension type (though that would be cool). Regular parquet readers such as pyarrow would thus see the column as a binary column, while libraries such as geopandas which implement a GeoParquet reader would look at the schema metadata.

E.g. taking this file as an example, this is how the 'geo' schema looks like:

import pyarrow.parquet as pq

schema = pq.read_schema(where="32VLM_v01.gpq")
print(schema.metadata[b"geo"])
{
    "primary_column": "geometry",
    "columns": {
        "geometry": {
            "encoding": "WKB",
            "crs": {
                "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
                "type": "GeographicCRS",
                "name": "WGS 84 (CRS84)",
                "datum_ensemble": {
                    "name": "World Geodetic System 1984 ensemble",
                    "members": [
                        {"name": "World Geodetic System 1984 (Transit)"},
                        {"name": "World Geodetic System 1984 (G730)"},
                        {"name": "World Geodetic System 1984 (G873)"},
                        {"name": "World Geodetic System 1984 (G1150)"},
                        {"name": "World Geodetic System 1984 (G1674)"},
                        {"name": "World Geodetic System 1984 (G1762)"},
                        {"name": "World Geodetic System 1984 (G2139)"},
                    ],
                    "ellipsoid": {
                        "name": "WGS 84",
                        "semi_major_axis": 6378137,
                        "inverse_flattening": 298.257223563,
                    },
                    "accuracy": "2.0",
                    "id": {"authority": "EPSG", "code": 6326},
                },
                "coordinate_system": {
                    "subtype": "ellipsoidal",
                    "axis": [
                        {
                            "name": "Geodetic longitude",
                            "abbreviation": "Lon",
                            "direction": "east",
                            "unit": "degree",
                        },
                        {
                            "name": "Geodetic latitude",
                            "abbreviation": "Lat",
                            "direction": "north",
                            "unit": "degree",
                        },
                    ],
                },
                "scope": "Not known.",
                "area": "World.",
                "bbox": {
                    "south_latitude": -90,
                    "west_longitude": -180,
                    "north_latitude": 90,
                    "east_longitude": 180,
                },
                "id": {"authority": "OGC", "code": "CRS84"},
            },
            "geometry_types": ["Polygon"],
            "bbox": [
                5.370542846111244,
                59.42344573656881,
                7.368267282586697,
                60.42591328670696,
            ],
        }
    },
    "version": "1.0.0",
    "creator": {"library": "geopandas", "version": "0.14.1"},
}

We can continue the discussion on how to handle this extra 'geo' schema metadata in #6438. I'd like to keep this PR small by just piggy-backing off the default Parquet reader for now, which would just show the 'geometry' column as a binary column.

Getting a sample GeoParquet file from https://github.com/opengeospatial/geoparquet/raw/v1.0.0/examples/example.parquet, saving it with a .geoparquet extension, and try to load it back, checking that column dtypes are correct. Also decided to put .geoparquet and .gpq in the _EXTENSION_TO_MODULE dictionary directly.
@lhoestq
Copy link
Member

lhoestq commented Jan 2, 2024

Thanks ! Also if you can make sure that doing ds.to_parquet("path/to/output.geoparquet") also saves as a valid geoparquet files (including the schema metadata) that would be awesome.

It would also enable push_to_hub to save geoparquet files

@weiji14
Copy link
Contributor Author

weiji14 commented Jan 8, 2024

Thanks ! Also if you can make sure that doing ds.to_parquet("path/to/output.geoparquet") also saves as a valid geoparquet files (including the schema metadata) that would be awesome.

It would also enable push_to_hub to save geoparquet files

Hmm, it should be possible to let PyArrow save a Parquet file with a geometry WKB column, but saving the GeoParquet schema metadata won't be easy without introducing geopandas as a dependency. Does this need to be done in this PR, or can it be a separate one?

@lhoestq
Copy link
Member

lhoestq commented Jan 9, 2024

I see, then let's keep it like this for now.
I just checked and it would require to add support for keeping the schema metadata in Features anyway.

Feel free to fix your code formatting using

make style

and we can merge this PR :)

@weiji14
Copy link
Contributor Author

weiji14 commented Jan 11, 2024

Cool, linted to remove the extra blank line at 7088f58. 🚀

@weiji14
Copy link
Contributor Author

weiji14 commented Jan 25, 2024

The previous CI failure at https://github.com/huggingface/datasets/actions/runs/7482863299/job/20668381959#step:6:5299 says datasets.exceptions.DefunctDatasetError: Dataset 'eli5' is defunct and no longer accessible due to unavailability of the source data which seems unrelated, might be to do with #6605. I've updated the PR branch with changes from main again, could someone re-run the tests and merge if ok? Thanks!

@lhoestq
Copy link
Member

lhoestq commented Jan 26, 2024

sorry, it took me some time to fix the CI on the main branch

will merge once it's green :)

@lhoestq lhoestq merged commit fabc2c8 into huggingface:main Jan 26, 2024
12 checks passed
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005467 / 0.011353 (-0.005886) 0.003696 / 0.011008 (-0.007313) 0.063298 / 0.038508 (0.024790) 0.032209 / 0.023109 (0.009100) 0.246307 / 0.275898 (-0.029591) 0.276864 / 0.323480 (-0.046616) 0.003941 / 0.007986 (-0.004044) 0.002616 / 0.004328 (-0.001713) 0.049543 / 0.004250 (0.045292) 0.044886 / 0.037052 (0.007833) 0.266502 / 0.258489 (0.008013) 0.288401 / 0.293841 (-0.005440) 0.027911 / 0.128546 (-0.100635) 0.011011 / 0.075646 (-0.064636) 0.207972 / 0.419271 (-0.211299) 0.036324 / 0.043533 (-0.007209) 0.259450 / 0.255139 (0.004311) 0.267317 / 0.283200 (-0.015883) 0.018857 / 0.141683 (-0.122826) 1.145350 / 1.452155 (-0.306805) 1.204204 / 1.492716 (-0.288513)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.103864 / 0.018006 (0.085858) 0.306941 / 0.000490 (0.306451) 0.000218 / 0.000200 (0.000018) 0.000044 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.018391 / 0.037411 (-0.019020) 0.064600 / 0.014526 (0.050074) 0.075454 / 0.176557 (-0.101102) 0.120913 / 0.737135 (-0.616223) 0.076998 / 0.296338 (-0.219341)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.279491 / 0.215209 (0.064282) 2.742471 / 2.077655 (0.664816) 1.447980 / 1.504120 (-0.056140) 1.328202 / 1.541195 (-0.212992) 1.397291 / 1.468490 (-0.071199) 0.585726 / 4.584777 (-3.999051) 2.385132 / 3.745712 (-1.360580) 2.874888 / 5.269862 (-2.394974) 1.820177 / 4.565676 (-2.745500) 0.063876 / 0.424275 (-0.360399) 0.004946 / 0.007607 (-0.002661) 0.336445 / 0.226044 (0.110401) 3.396813 / 2.268929 (1.127885) 1.832644 / 55.444624 (-53.611981) 1.581304 / 6.876477 (-5.295172) 1.607243 / 2.142072 (-0.534829) 0.662752 / 4.805227 (-4.142476) 0.119494 / 6.500664 (-6.381170) 0.042573 / 0.075469 (-0.032896)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.936784 / 1.841788 (-0.905003) 12.154288 / 8.074308 (4.079980) 10.944835 / 10.191392 (0.753443) 0.132856 / 0.680424 (-0.547568) 0.015197 / 0.534201 (-0.519004) 0.290647 / 0.579283 (-0.288636) 0.273498 / 0.434364 (-0.160866) 0.324893 / 0.540337 (-0.215444) 0.427905 / 1.386936 (-0.959032)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005695 / 0.011353 (-0.005658) 0.003562 / 0.011008 (-0.007446) 0.050117 / 0.038508 (0.011608) 0.033876 / 0.023109 (0.010767) 0.275514 / 0.275898 (-0.000384) 0.298460 / 0.323480 (-0.025020) 0.004240 / 0.007986 (-0.003745) 0.002738 / 0.004328 (-0.001591) 0.048518 / 0.004250 (0.044268) 0.049064 / 0.037052 (0.012012) 0.287094 / 0.258489 (0.028605) 0.314281 / 0.293841 (0.020440) 0.057861 / 0.128546 (-0.070686) 0.010893 / 0.075646 (-0.064753) 0.062251 / 0.419271 (-0.357020) 0.036788 / 0.043533 (-0.006745) 0.272431 / 0.255139 (0.017292) 0.292022 / 0.283200 (0.008822) 0.019874 / 0.141683 (-0.121809) 1.156939 / 1.452155 (-0.295216) 1.237966 / 1.492716 (-0.254751)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.096156 / 0.018006 (0.078150) 0.306652 / 0.000490 (0.306162) 0.000230 / 0.000200 (0.000031) 0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022965 / 0.037411 (-0.014447) 0.081349 / 0.014526 (0.066823) 0.089035 / 0.176557 (-0.087521) 0.128831 / 0.737135 (-0.608304) 0.090321 / 0.296338 (-0.206017)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.293110 / 0.215209 (0.077901) 2.884493 / 2.077655 (0.806839) 1.582522 / 1.504120 (0.078402) 1.518977 / 1.541195 (-0.022218) 1.528449 / 1.468490 (0.059959) 0.577369 / 4.584777 (-4.007408) 2.473060 / 3.745712 (-1.272652) 3.104363 / 5.269862 (-2.165499) 1.916529 / 4.565676 (-2.649147) 0.064594 / 0.424275 (-0.359682) 0.005386 / 0.007607 (-0.002221) 0.353336 / 0.226044 (0.127292) 3.471914 / 2.268929 (1.202985) 1.959222 / 55.444624 (-53.485402) 1.677153 / 6.876477 (-5.199324) 1.716961 / 2.142072 (-0.425112) 0.658355 / 4.805227 (-4.146873) 0.117296 / 6.500664 (-6.383368) 0.041139 / 0.075469 (-0.034330)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.025220 / 1.841788 (-0.816567) 14.510987 / 8.074308 (6.436679) 11.851428 / 10.191392 (1.660036) 0.143759 / 0.680424 (-0.536665) 0.015644 / 0.534201 (-0.518557) 0.296824 / 0.579283 (-0.282459) 0.281566 / 0.434364 (-0.152798) 0.335094 / 0.540337 (-0.205244) 0.425199 / 1.386936 (-0.961737)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants