Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP shoul…
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
- Loading branch information