-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489
Comments
Thanks for point this out! |
@joellubi I found this is only called in the code path below:
I guess this will not frequently call in code path, because:
|
Some related discussions:
Now I suppose that because of those mentioned issues, when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the |
…y adjustedToUtc (#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: #39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
@jorisvandenbossche Yes. As I mentioned, this might only happens when:
|
Actually, my comment above was wrong:
This is exactly what your PR changed, i.e. it's the conversion from a Parquet type to Arrow type ( And so the consequence of this is that reading an older file written by Arrow, will now be read differently (and this was a compatibility that was explicitly added in #4856). It's maybe OK to change this behaviour, so we give up faithful roundtrip for those older files written by Arrow (in the end, it are old files), in favor of better compatibility with the Parquet spec and other producers, but we need to be explicit about that this is actually a behaviour change for users. |
Yes, we can document it before we releasing 16.0 |
To make it concrete: assume I have a file written with pyarrow 0.11 for data in Arrow with tz-naive timestamps (uploaded here as a zip: test_timestamp_pa011.zip): >>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> pa.__version__
'0.11.1'
>>> table = pa.Table.from_arrays([pa.array([1, 2, 3], pa.timestamp("us"))], ['col'])
>>> pq.write_table(table, "test_timestamp_pa011.parquet") and then we read it with latest released pyarrow, we faithfully roundtrip the Arrow data: >>> pa.__version__
'14.0.1'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us]
>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:00.000001,1970-01-01 00:00:00.000002,1970-01-01 00:00:00.000003]] But now with latest main (after #39491), the read data is no longer the same as the original data (the type now says that the data is in UTC, while the original data were local timestamps in an unknown timezone): >>> pa.__version__
'15.0.0.dev402+gc752bdb08'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f6969c44c80>
required group field_id=-1 schema {
optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]
>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us, tz=UTC]
----
col: [[1970-01-01 00:00:00.000001Z,1970-01-01 00:00:00.000002Z,1970-01-01 00:00:00.000003Z]] |
Yeah.
Hmmm this regard Do you think we need an extra config, or just changing this? |
As I tried to explain: no, it's not a bug, it was an explicit design decision (the links I posted above, #22303 / #4856). (but of course we can now disagree with an old design decision ;)) When creating the ParquetSchema from parsing the Parquet file/thrift, we did set |
Ah I got your meaning, however, does this means the legacy file cannot have any timezone? So fell free to revert the converting, or let me add a configurations🤔 |
Yes, reading such a legacy file would essentially never give you a timezone (but that was a generic limitation of the Parquet spec, not something Arrow could do anything about). Essentially back then the choice needed to be made between "always returning no timezone" (and potentially losing some information, if the original data was tz-aware) or "always returning with timezone" (and potentially adding wrong information, if the original data was not in UTC). |
I'll draft another patch which adding these in doc, and revert the change there |
Thank you for adding those details @jorisvandenbossche, it really helped clarify the context of the previous decision. In thinking about this, it seems there may actually be two separate but related goals for the conversion:
Since the legacy Parquet types cannot distinguish between instant and local semantics, consumers have had to make an arbitrary choice between the two as you described. It does seem, at least now, that the Parquet spec has an explicit position that the deprecated convertedType timestamps always had instant semantics, and that local semantics were unsupported (source). This helps make the current position clear but doesn't change the reality of real-world Parquet usage that has not followed these semantics. In the case that a Parquet file is being read and we don't know (or care) how it was produced, I do think it makes sense to follow compatibility guidelines (as the recently-merged PR has). This does break roundtrip behavior (thank you for pointing this out), but I do think Arrow already has an independent solution to this case. The metadata associated with the Desired behavior IMO: >>> pq.read_metadata("test_timestamp_convertedtype_writer_unknown.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]
>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_no_tz.parquet").schema.to_arrow_schema()
col: timestamp[us]
>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_with_tz.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC] |
Guess a explicit cast in required in this scenerio... |
This additional So while I certainly agree with your "desired behaviour" being desired, it's not possible to distinguish with or without tz (even for files written by Arrow), because the only information there is in the file is the ConvertedType. |
I see, indeed unfortunate. Would you be in favor of adding a field to |
Comment from @pitrou on the PR at #39491 (comment):
We could indeed infer if the file was written by Arrow (the version shouldn't matter, I think, because in the end we care about whether the LogicalType is present or not, which is always the case starting from a certain version).
Above (#39489 (comment)) I posted a small example, and updated that comment to attach the generated file that I used. But, yes, in general we have a lack of testing for Parquet on this front: #22325
To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries. Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?) And again: this whole change is maybe fine nowadays. I mostly wanted to point out this was not a bug, but a deliberate design decision in the past, and if we change this, we should do it deliberately and flag it as a breaking change. |
I agree this could be confusing. For example datalakes are often written to by many potentially different producers so inconsistencies when reading could surface in this scenario.
I think that an option to control the behavior would be valuable, but a default that doesn't prompt users to update their files or application logic probably isn't helpful. Do we use warnings or any other ways deprecating certain usage in other parts of Arrow?
For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here. |
Anyway, I think I can document the behavior here. |
…al Semantics (#39467) ### Rationale for this change Closes: #39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: #39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Re-close this |
Describe the bug, including details regarding any error messages, version, and platform.
The Parquet format document indicates that converted types should have
isAdjustedToUtc=true
for backward compatibility, but the current C++ implementation sets this tofalse
for converted types.The Rust implementation already follows the spec as described and the Go implementation has #39467 open to correct this.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: