Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45522: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Open
wants to merge 107 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
3eb8351
update thrift
paleolimbot Jun 29, 2024
a5d9b0a
update thrift defs
paleolimbot Jun 29, 2024
ab1b55e
add stubs
paleolimbot Jun 30, 2024
c11ac8c
split methods out of line
paleolimbot Jun 30, 2024
00aceb0
maybe to/from thrift
paleolimbot Jul 9, 2024
afbe43c
a few more serializers
paleolimbot Jul 9, 2024
abcee37
add basic test for serialization
paleolimbot Aug 7, 2024
ea2b32b
add sort order check
paleolimbot Aug 7, 2024
9231e68
update thrift
paleolimbot Aug 12, 2024
b4f0cdc
start geom utiles
paleolimbot Aug 12, 2024
3f251ce
test roundtrip thrift cases
paleolimbot Aug 14, 2024
860a7cd
more geometry utils
paleolimbot Aug 14, 2024
0616c34
more
paleolimbot Aug 14, 2024
05dc071
bounder
paleolimbot Aug 14, 2024
c0adac1
add basic test
paleolimbot Aug 14, 2024
2816fb0
a few more strings
paleolimbot Aug 14, 2024
7aef800
test some bounding box things
paleolimbot Aug 14, 2024
d2610a5
more tests
paleolimbot Aug 14, 2024
17381a4
fix test
paleolimbot Aug 14, 2024
dd9a5d2
with passing tests
paleolimbot Aug 14, 2024
19e0d63
add in WKT equiv
paleolimbot Aug 14, 2024
0caf1ed
more tests
paleolimbot Aug 14, 2024
983aecd
start on stats
paleolimbot Aug 17, 2024
c20541c
implement update/merge for geometry statistics
paleolimbot Aug 19, 2024
0785d6b
more complete stats
paleolimbot Aug 19, 2024
a2370ab
start on factory methods
paleolimbot Aug 19, 2024
90e9067
more stats things
paleolimbot Aug 19, 2024
7dae969
maybe work with serde
paleolimbot Aug 19, 2024
4d37785
Update cpp/src/parquet/types.cc
paleolimbot Aug 19, 2024
953f912
Updated parquet.thrift and re-generated cpp sources
Kontinuation Sep 3, 2024
470724c
Geometry value writer could make use of the geometry statistics class to
Kontinuation Sep 3, 2024
6d5c810
Geometry column writer now populates correct statistics
Kontinuation Sep 4, 2024
92332ae
format/tidy
Kontinuation Sep 4, 2024
442daf0
Run clang-tidy
Kontinuation Sep 4, 2024
19a7e91
Added a test that writes and reads a parquet file containing a geomet…
Kontinuation Sep 5, 2024
d4b3d48
Remove redundant include
Kontinuation Sep 6, 2024
85240a3
Fix problems found by reviewers
Kontinuation Sep 6, 2024
41bd029
Try to make it build properly on other platforms
Kontinuation Sep 6, 2024
2cef359
Address review comments in https://github.com/apache/arrow/pull/43196
Kontinuation Sep 6, 2024
4d60bc8
Resolve compile errors for MSVC
Kontinuation Sep 6, 2024
013dd55
Expose getters in GeometryStatistics, Change geometry_types from
Kontinuation Sep 10, 2024
3846082
Add test case for UpdateSpaced, don't generate min/max stats for geom…
Kontinuation Sep 11, 2024
699812f
Support covering
Kontinuation Sep 11, 2024
46ff6da
MakeStatistics and Statistics::Make should not be a breaking change
Kontinuation Sep 12, 2024
c2edf01
ColumnIndex, as well as some other fixes and refacturings
Kontinuation Sep 12, 2024
7b7f47c
Fix compiler warnings on AMD platforms as well as sanitizer warnings
Kontinuation Sep 12, 2024
d5023ea
Remove all newly added include directives
Kontinuation Sep 12, 2024
68bf190
include cmath for std::isnan
Kontinuation Sep 12, 2024
4f9e96c
Test writing WKB encoded geometries using WriteArrow
Kontinuation Sep 16, 2024
e9fc02b
Change the sort order of geometry from unknown to unsigned; resolved …
Kontinuation Sep 19, 2024
afbbaf3
Add generate_covering_ member to be explicit that' we'll generate the…
Kontinuation Sep 19, 2024
f976b35
Refactor unscoped enums in geometry_util_internal to enum classes
Kontinuation Sep 19, 2024
2889b77
Revert more special case handling for unknown sort order
Kontinuation Sep 19, 2024
5b254cd
Fix WKB covering test to take native endianness into consideration
Kontinuation Sep 19, 2024
52ce32a
min/max of geometry columns are the WKB representation of lower-left …
Kontinuation Sep 19, 2024
261681d
Address latest review comments
Kontinuation Sep 20, 2024
e76238e
A better implementation of geometry min/max statistics
Kontinuation Sep 20, 2024
968df75
Update the code to accomodate the latest changes of the standard:
Kontinuation Oct 7, 2024
830e1fe
Fix problem decoding WKB geometries with more than 32 coordinates
Kontinuation Oct 15, 2024
d8145d1
Re-implemented geometry statistics according to the updated spec:
Kontinuation Oct 30, 2024
9c692d9
Revert some unnecessary changes
Kontinuation Oct 30, 2024
545a1cb
update so that it all builds
paleolimbot Feb 6, 2025
5387191
tests passing!
paleolimbot Feb 6, 2025
9db565e
remove change
paleolimbot Feb 7, 2025
8d67b38
update thrift
paleolimbot Feb 7, 2025
dfd3d78
cpp geometry type update
paleolimbot Feb 7, 2025
4601832
add geography type
paleolimbot Feb 7, 2025
d0ef7d2
handle renaming of the statistics and types
paleolimbot Feb 7, 2025
48a26b2
building tests
paleolimbot Feb 7, 2025
250dddd
passing tests!
paleolimbot Feb 7, 2025
633b2f4
add other edge algorithms
paleolimbot Feb 7, 2025
facc3c4
canonically export default spherical edges for geography
paleolimbot Feb 7, 2025
e7aa91c
edges to algorithm
paleolimbot Feb 7, 2025
c205fd4
clang-format
paleolimbot Feb 7, 2025
452af5a
undo page index change
paleolimbot Feb 7, 2025
2948893
attempt to fix windows CI error
paleolimbot Feb 7, 2025
9a3be69
maybe use the right header
paleolimbot Feb 7, 2025
9b9b78b
add arrow converter maybe
paleolimbot Feb 7, 2025
75d5fc2
add geoarrow read
paleolimbot Feb 7, 2025
ef95df8
maybe fix returning temporary object
paleolimbot Feb 7, 2025
b63ed04
fix build
paleolimbot Feb 7, 2025
ccd1528
test some CRS propagation
paleolimbot Feb 7, 2025
ea1a360
add some Cython Parquet definitions
paleolimbot Feb 7, 2025
87196d2
maybe pipe through objects
paleolimbot Feb 7, 2025
22d3e40
maybe expose geospatial stat items
paleolimbot Feb 7, 2025
20cc72b
add reprs and accessors to geospatial stats
paleolimbot Feb 7, 2025
b5dec3b
format
paleolimbot Feb 7, 2025
a7a635d
more format
paleolimbot Feb 7, 2025
bf3e8f3
comment out code for a second to check read
paleolimbot Feb 7, 2025
5d8d92d
geospatial types opt-in on write
paleolimbot Feb 10, 2025
74b48ba
check lonlat cases for geoarrow
paleolimbot Feb 10, 2025
acf1cf1
add srid case
paleolimbot Feb 10, 2025
2d61a4d
extension properties to python
paleolimbot Feb 10, 2025
92ea069
fix some typos in parquet args
paleolimbot Feb 10, 2025
6b99d14
fix signature of exported cython function
paleolimbot Feb 10, 2025
bb18779
format
paleolimbot Feb 10, 2025
0d364bb
maybe fix documentation lint error
paleolimbot Feb 11, 2025
31dc936
fix the order
paleolimbot Feb 11, 2025
2931a24
simplify geometry utility
paleolimbot Feb 13, 2025
35ec0c6
remove some unneded parts of geometry_util_internal
paleolimbot Feb 13, 2025
465007b
use status
paleolimbot Feb 13, 2025
8cb2868
maybe fix build
paleolimbot Feb 13, 2025
e65eb37
one more pipe through of extension option
paleolimbot Feb 13, 2025
0014da3
consolidate XYXX typedefs
paleolimbot Feb 13, 2025
16f2880
Consolidate cursor management in WKBBuffer
paleolimbot Feb 13, 2025
a07a795
reduce scope of the geometry type and dimensions classes
paleolimbot Feb 13, 2025
036258d
simplify bbox even more
paleolimbot Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,613 changes: 2,537 additions & 1,076 deletions cpp/src/generated/parquet_types.cpp

Large diffs are not rendered by default.

1,597 changes: 588 additions & 1,009 deletions cpp/src/generated/parquet_types.h

Large diffs are not rendered by default.

941 changes: 706 additions & 235 deletions cpp/src/generated/parquet_types.tcc

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions cpp/src/parquet/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ set(PARQUET_SRCS
exception.cc
file_reader.cc
file_writer.cc
geometry_statistics.cc
level_comparison.cc
level_conversion.cc
metadata.cc
Expand Down Expand Up @@ -259,6 +260,10 @@ endif()
if(NOT PARQUET_MINIMAL_DEPENDENCY)
list(APPEND PARQUET_SHARED_LINK_LIBS arrow_shared)

# TODO(paleolimbot): Remove once sample files are generated
list(APPEND PARQUET_SHARED_LINK_LIBS RapidJSON)
list(APPEND PARQUET_STATIC_LINK_LIBS RapidJSON)

# These are libraries that we will link privately with parquet_shared (as they
# do not need to be linked transitively by other linkers)
list(APPEND PARQUET_SHARED_PRIVATE_LINK_LIBS thrift::thrift)
Expand Down Expand Up @@ -372,6 +377,7 @@ add_parquet_test(internals-test
statistics_test.cc
encoding_test.cc
metadata_test.cc
geometry_util_internal_test.cc
page_index_test.cc
public_api_test.cc
size_statistics_test.cc
Expand Down
1 change: 1 addition & 0 deletions cpp/src/parquet/api/reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
#include "parquet/column_scanner.h"
#include "parquet/exception.h"
#include "parquet/file_reader.h"
#include "parquet/geometry_statistics.h"
#include "parquet/metadata.h"
#include "parquet/platform.h"
#include "parquet/printer.h"
Expand Down
192 changes: 191 additions & 1 deletion cpp/src/parquet/arrow/arrow_schema_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
#include "arrow/array.h"
#include "arrow/extension/json.h"
#include "arrow/ipc/writer.h"
#include "arrow/testing/extension_type.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/type.h"
#include "arrow/util/base64.h"
Expand Down Expand Up @@ -69,6 +70,42 @@ const auto TIMESTAMP_NS = ::arrow::timestamp(TimeUnit::NANO);
const auto BINARY = ::arrow::binary();
const auto DECIMAL_8_4 = std::make_shared<::arrow::Decimal128Type>(8, 4);

// A minimal version of a geoarrow.wkb extension type to test interoperability
class GeoArrowWkbExtensionType : public ::arrow::ExtensionType {
public:
explicit GeoArrowWkbExtensionType(std::shared_ptr<::arrow::DataType> storage_type,
std::string metadata)
: ::arrow::ExtensionType(std::move(storage_type)), metadata_(std::move(metadata)) {}

std::string extension_name() const override { return "geoarrow.wkb"; }

std::string Serialize() const override { return metadata_; }

::arrow::Result<std::shared_ptr<::arrow::DataType>> Deserialize(
std::shared_ptr<::arrow::DataType> storage_type,
const std::string& serialized_data) const override {
return std::make_shared<GeoArrowWkbExtensionType>(std::move(storage_type),
serialized_data);
}

std::shared_ptr<::arrow::Array> MakeArray(
std::shared_ptr<::arrow::ArrayData> data) const override {
return std::make_shared<::arrow::ExtensionArray>(data);
}

bool ExtensionEquals(const ExtensionType& other) const override {
return other.extension_name() == extension_name() && other.Serialize() == Serialize();
}

private:
std::string metadata_;
};

std::shared_ptr<::arrow::DataType> geoarrow_wkb(std::string metadata = "{}") {
return std::make_shared<GeoArrowWkbExtensionType>(::arrow::binary(),
std::move(metadata));
}

class TestConvertParquetSchema : public ::testing::Test {
public:
virtual void SetUp() {}
Expand Down Expand Up @@ -236,6 +273,10 @@ TEST_F(TestConvertParquetSchema, ParquetAnnotatedFields) {
::arrow::int64()},
{"json", LogicalType::JSON(), ParquetType::BYTE_ARRAY, -1, ::arrow::utf8()},
{"bson", LogicalType::BSON(), ParquetType::BYTE_ARRAY, -1, ::arrow::binary()},
{"geometry", LogicalType::Geometry(), ParquetType::BYTE_ARRAY, -1,
::arrow::binary()},
{"geography", LogicalType::Geography(), ParquetType::BYTE_ARRAY, -1,
::arrow::binary()},
{"interval", LogicalType::Interval(), ParquetType::FIXED_LEN_BYTE_ARRAY, 12,
::arrow::fixed_size_binary(12)},
{"uuid", LogicalType::UUID(), ParquetType::FIXED_LEN_BYTE_ARRAY, 16,
Expand Down Expand Up @@ -948,6 +989,48 @@ TEST_F(TestConvertParquetSchema, ParquetSchemaArrowExtensions) {
}
}

TEST_F(TestConvertParquetSchema, ParquetSchemaGeoArrowExtensions) {
std::vector<NodePtr> parquet_fields;
parquet_fields.push_back(PrimitiveNode::Make("geometry", Repetition::OPTIONAL,
LogicalType::Geometry(),
ParquetType::BYTE_ARRAY));
parquet_fields.push_back(PrimitiveNode::Make("geography", Repetition::OPTIONAL,
LogicalType::Geography(),
ParquetType::BYTE_ARRAY));

{
// Parquet file does not contain Arrow schema.
// By default, both fields should be treated as binary() fields in Arrow.
auto arrow_schema = ::arrow::schema({::arrow::field("geometry", BINARY, true),
::arrow::field("geography", BINARY, true)});
std::shared_ptr<KeyValueMetadata> metadata{};
ASSERT_OK(ConvertSchema(parquet_fields, metadata));
CheckFlatSchema(arrow_schema);
}

{
// Parquet file does not contain Arrow schema.
// If Arrow extensions are enabled and extensions are registered,
// fields will be interpreted as geoarrow_wkb(binary()) extension fields.
::arrow::ExtensionTypeGuard guard(geoarrow_wkb());

ArrowReaderProperties props;
props.set_arrow_extensions_enabled(true);
auto arrow_schema = ::arrow::schema(
{::arrow::field(
"geometry",
geoarrow_wkb(R"({"crs": "OGC:CRS84", "crs_type": "authority_code"})"), true),
::arrow::field(
"geography",
geoarrow_wkb(
R"({"crs": "OGC:CRS84", "crs_type": "authority_code", "edges": "spherical"})"),
true)});
std::shared_ptr<KeyValueMetadata> metadata{};
ASSERT_OK(ConvertSchema(parquet_fields, metadata, props));
CheckFlatSchema(arrow_schema);
}
}

class TestConvertArrowSchema : public ::testing::Test {
public:
virtual void SetUp() {}
Expand All @@ -963,7 +1046,8 @@ class TestConvertArrowSchema : public ::testing::Test {
for (int i = 0; i < expected_schema_node->field_count(); i++) {
auto lhs = result_schema_node->field(i);
auto rhs = expected_schema_node->field(i);
EXPECT_TRUE(lhs->Equals(rhs.get()));
EXPECT_TRUE(lhs->Equals(rhs.get()))
<< lhs->logical_type()->ToString() << " != " << rhs->logical_type()->ToString();
}
}

Expand Down Expand Up @@ -1201,6 +1285,112 @@ TEST_F(TestConvertArrowSchema, ParquetFlatPrimitivesAsDictionaries) {
ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(parquet_fields));
}

TEST_F(TestConvertArrowSchema, ParquetGeoArrowCrsLonLat) {
// All the Arrow Schemas below should convert to the type defaults for GEOMETRY
// and GEOGRAPHY when GeoArrow extension types are registered and the appropriate
// writer option is set.
::arrow::ExtensionTypeGuard guard(geoarrow_wkb());

ArrowWriterProperties::Builder builder;
builder.write_geospatial_logical_types();
auto arrow_properties = builder.build();

std::vector<NodePtr> parquet_fields;
parquet_fields.push_back(PrimitiveNode::Make("geometry", Repetition::OPTIONAL,
LogicalType::Geometry(),
ParquetType::BYTE_ARRAY));
parquet_fields.push_back(PrimitiveNode::Make("geography", Repetition::OPTIONAL,
LogicalType::Geography(),
ParquetType::BYTE_ARRAY));

// There are several ways that longitude/latitude could be specified when coming from
// GeoArrow, which allows null, missing, arbitrary strings (e.g., Authority:Code), and
// PROJJSON.
std::vector<std::string> geoarrow_lonlat = {
"null", R"("OGC:CRS84")", R"("EPSG:4326")",
// Purely the parts of the PROJJSON that we inspect to check the lon/lat case
R"({"id": {"authority": "OGC", "code": "CRS84"}})",
R"({"id": {"authority": "EPSG", "code": 4326}})"};

std::string geoarrow_lonlatish_crs = geoarrow_lonlat[0];
for (const auto& geoarrow_lonlatish_crs : geoarrow_lonlat) {
SCOPED_TRACE(geoarrow_lonlatish_crs);
std::vector<std::shared_ptr<Field>> arrow_fields = {
::arrow::field("geometry",
geoarrow_wkb(R"({"crs": )" + geoarrow_lonlatish_crs + "}"), true),
::arrow::field("geography",
geoarrow_wkb(R"({"crs": )" + geoarrow_lonlatish_crs +
R"(, "edges": "spherical"})"),
true)};

ASSERT_OK(ConvertSchema(arrow_fields, arrow_properties));
ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(parquet_fields));
}
}

TEST_F(TestConvertArrowSchema, ParquetGeoArrowCrsSrid) {
// Checks the conversion between GeoArrow's crs_type: srid and Parquet's srid:XXX.
// SRID (spatial reference identifier) is an opaque application specific identifier
// that GeoArrow will transport but refuse to resolve if required for a spatial
// operation.
::arrow::ExtensionTypeGuard guard(geoarrow_wkb());

ArrowWriterProperties::Builder builder;
builder.write_geospatial_logical_types();
auto arrow_properties = builder.build();

std::vector<NodePtr> parquet_fields;
parquet_fields.push_back(PrimitiveNode::Make("geometry", Repetition::OPTIONAL,
LogicalType::Geometry("srid:1234"),
ParquetType::BYTE_ARRAY));
parquet_fields.push_back(PrimitiveNode::Make("geography", Repetition::OPTIONAL,
LogicalType::Geography("srid:5678"),
ParquetType::BYTE_ARRAY));

std::vector<std::shared_ptr<Field>> arrow_fields = {
::arrow::field("geometry", geoarrow_wkb(R"({"crs": "1234", "crs_type": "srid"})"),
true),
::arrow::field(
"geography",
geoarrow_wkb(R"({"crs": "5678", "crs_type": "srid", "edges": "spherical"})"),
true)};

ASSERT_OK(ConvertSchema(arrow_fields, arrow_properties));
ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(parquet_fields));
}

TEST_F(TestConvertArrowSchema, ParquetGeoArrowCrsProjjson) {
GTEST_SKIP() << "GeoArrow/PROJJSON support not yet implemented";

// Checks the conversion between GeoArrow that contains non-lon/lat PROJJSON
// to Parquet. Almost all GeoArrow types that arrive at the Parquet reader
// will have their CRS expressed in this way.
::arrow::ExtensionTypeGuard guard(geoarrow_wkb());

ArrowWriterProperties::Builder builder;
builder.write_geospatial_logical_types();
auto arrow_properties = builder.build();

std::vector<NodePtr> parquet_fields;
parquet_fields.push_back(PrimitiveNode::Make("geometry", Repetition::OPTIONAL,
LogicalType::Geometry("projjson:1234"),
ParquetType::BYTE_ARRAY));
parquet_fields.push_back(PrimitiveNode::Make("geography", Repetition::OPTIONAL,
LogicalType::Geography("projjson:5678"),
ParquetType::BYTE_ARRAY));

std::vector<std::shared_ptr<Field>> arrow_fields = {
::arrow::field("geometry", geoarrow_wkb(R"({"crs": "1234", "crs_type": "srid"})"),
true),
::arrow::field(
"geography",
geoarrow_wkb(R"({"crs": "5678", "crs_type": "srid", "edges": "spherical"})"),
true)};

ASSERT_OK(ConvertSchema(arrow_fields, arrow_properties));
ASSERT_NO_FATAL_FAILURE(CheckFlatSchema(parquet_fields));
}

TEST_F(TestConvertArrowSchema, ParquetLists) {
std::vector<NodePtr> parquet_fields;
std::vector<std::shared_ptr<Field>> arrow_fields;
Expand Down
Loading
Loading