Skip to content

Commit

Permalink
apacheGH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP shoul…
Browse files Browse the repository at this point in the history
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
  • Loading branch information
mapleFU authored and zanmato1984 committed Feb 28, 2024
1 parent 3d04fe7 commit b531d44
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 9 deletions.
4 changes: 2 additions & 2 deletions cpp/src/parquet/arrow/arrow_schema_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -115,13 +115,13 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) {
ParquetType::INT64,
ConvertedType::TIMESTAMP_MILLIS));
arrow_fields.push_back(
::arrow::field("timestamp", ::arrow::timestamp(TimeUnit::MILLI), false));
::arrow::field("timestamp", ::arrow::timestamp(TimeUnit::MILLI, "UTC"), false));

parquet_fields.push_back(PrimitiveNode::Make("timestamp[us]", Repetition::REQUIRED,
ParquetType::INT64,
ConvertedType::TIMESTAMP_MICROS));
arrow_fields.push_back(
::arrow::field("timestamp[us]", ::arrow::timestamp(TimeUnit::MICRO), false));
::arrow::field("timestamp[us]", ::arrow::timestamp(TimeUnit::MICRO, "UTC"), false));

parquet_fields.push_back(PrimitiveNode::Make("date", Repetition::REQUIRED,
ParquetType::INT32, ConvertedType::DATE));
Expand Down
4 changes: 3 additions & 1 deletion cpp/src/parquet/arrow/schema.cc
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,9 @@ static std::shared_ptr<const LogicalType> TimestampLogicalTypeFromArrowTimestamp
/*is_from_converted_type=*/false,
/*force_set_converted_type=*/true);
case ::arrow::TimeUnit::NANO:
return LogicalType::Timestamp(utc, LogicalType::TimeUnit::NANOS);
return LogicalType::Timestamp(utc, LogicalType::TimeUnit::NANOS,
/*is_from_converted_type=*/false,
/*force_set_converted_type=*/false);
case ::arrow::TimeUnit::SECOND:
// No equivalent parquet logical type.
break;
Expand Down
3 changes: 1 addition & 2 deletions cpp/src/parquet/arrow/schema_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ Result<std::shared_ptr<ArrowType>> MakeArrowTime64(const LogicalType& logical_ty

Result<std::shared_ptr<ArrowType>> MakeArrowTimestamp(const LogicalType& logical_type) {
const auto& timestamp = checked_cast<const TimestampLogicalType&>(logical_type);
const bool utc_normalized =
timestamp.is_from_converted_type() ? false : timestamp.is_adjusted_to_utc();
const bool utc_normalized = timestamp.is_adjusted_to_utc();
static const char* utc_timezone = "UTC";
switch (timestamp.time_unit()) {
case LogicalType::TimeUnit::MILLIS:
Expand Down
18 changes: 14 additions & 4 deletions cpp/src/parquet/types.cc
Original file line number Diff line number Diff line change
Expand Up @@ -345,15 +345,25 @@ std::shared_ptr<const LogicalType> LogicalType::FromConvertedType(
case ConvertedType::DATE:
return DateLogicalType::Make();
case ConvertedType::TIME_MILLIS:
return TimeLogicalType::Make(true, LogicalType::TimeUnit::MILLIS);
// ConvertedType::TIME_{*} are deprecated in favor of LogicalType::Time, the
// compatibility for ConvertedType::TIME_{*} are listed in
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-time-convertedtype
return TimeLogicalType::Make(/*is_adjusted_to_utc=*/true,
LogicalType::TimeUnit::MILLIS);
case ConvertedType::TIME_MICROS:
return TimeLogicalType::Make(true, LogicalType::TimeUnit::MICROS);
return TimeLogicalType::Make(/*is_adjusted_to_utc=*/true,
LogicalType::TimeUnit::MICROS);
case ConvertedType::TIMESTAMP_MILLIS:
return TimestampLogicalType::Make(true, LogicalType::TimeUnit::MILLIS,
// ConvertedType::TIMESTAMP_{*} are deprecated in favor of LogicalType::Timestamp,
// the compatibility for ConvertedType::TIMESTAMP_{*} are listed in
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype
return TimestampLogicalType::Make(/*is_adjusted_to_utc=*/true,
LogicalType::TimeUnit::MILLIS,
/*is_from_converted_type=*/true,
/*force_set_converted_type=*/false);
case ConvertedType::TIMESTAMP_MICROS:
return TimestampLogicalType::Make(true, LogicalType::TimeUnit::MICROS,
return TimestampLogicalType::Make(/*is_adjusted_to_utc=*/true,
LogicalType::TimeUnit::MICROS,
/*is_from_converted_type=*/true,
/*force_set_converted_type=*/false);
case ConvertedType::INTERVAL:
Expand Down

0 comments on commit b531d44

Please sign in to comment.