[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

joellubi · 2024-01-06T23:47:09Z

Describe the bug, including details regarding any error messages, version, and platform.

The Parquet format document indicates that converted types should have isAdjustedToUtc=true for backward compatibility, but the current C++ implementation sets this to false for converted types.

The Rust implementation already follows the spec as described and the Go implementation has #39467 open to correct this.

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-01-07T06:08:10Z

Thanks for point this out!

mapleFU · 2024-01-07T08:34:27Z

@joellubi I found this is only called in the code path below:

FromInt64Statistics/GetArrowType
- FromInt64
-- MakeArrowTimestamp

I guess this will not frequently call in code path, because:

FromInt64Statistics is used for filtering, and this might making filtering invalid
GetArrowType is harmful, but if the type is annotated, it might not use this?

jorisvandenbossche · 2024-01-07T09:32:17Z

Some related discussions:

ARROW-5878: [C++][Parquet] Restore pre-0.14.0 Parquet forward compatibility by adding option to unconditionally set TIMESTAMP_MICROS/TIMESTAMP_MILLIS ConvertedType #4825
[Python][C++] Parquet backwards compat for timestamps without timezone broken #22303
ARROW-5889: [C++][Parquet] Add property to indicate origin from converted type to TimestampLogicalType #4856

Now I suppose that because of those mentioned issues, when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

…y adjustedToUtc (#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: #39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

mapleFU · 2024-01-10T07:34:03Z

Now I suppose that because of those mentioned issues, when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

@jorisvandenbossche Yes. As I mentioned, this might only happens when:

Filtering ( Statistics might got the wrong time )
Read a legacy file in parquet-mr

jorisvandenbossche · 2024-01-10T08:36:53Z

Actually, my comment above was wrong:

when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

This is exactly what your PR changed, i.e. it's the conversion from a Parquet type to Arrow type (MakeArrowTimestamp) that now no longer ignores isAdjustedToUtc for converted types.
So of course this does change behaviour, also for the Arrow APIs.

And so the consequence of this is that reading an older file written by Arrow, will now be read differently (and this was a compatibility that was explicitly added in #4856).

It's maybe OK to change this behaviour, so we give up faithful roundtrip for those older files written by Arrow (in the end, it are old files), in favor of better compatibility with the Parquet spec and other producers, but we need to be explicit about that this is actually a behaviour change for users.

mapleFU · 2024-01-10T08:38:27Z

Yes, we can document it before we releasing 16.0

jorisvandenbossche · 2024-01-10T08:45:54Z

To make it concrete: assume I have a file written with pyarrow 0.11 for data in Arrow with tz-naive timestamps (uploaded here as a zip: test_timestamp_pa011.zip):

>>> import pyarrow as pa 
>>> import pyarrow.parquet as pq  
>>> pa.__version__
'0.11.1'
>>> table = pa.Table.from_arrays([pa.array([1, 2, 3], pa.timestamp("us"))], ['col'])    
>>> pq.write_table(table, "test_timestamp_pa011.parquet")

and then we read it with latest released pyarrow, we faithfully roundtrip the Arrow data:

>>> pa.__version__
 '14.0.1'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us]

>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:00.000001,1970-01-01 00:00:00.000002,1970-01-01 00:00:00.000003]]

But now with latest main (after #39491), the read data is no longer the same as the original data (the type now says that the data is in UTC, while the original data were local timestamps in an unknown timezone):

>>> pa.__version__
'15.0.0.dev402+gc752bdb08'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f6969c44c80>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us, tz=UTC]
----
col: [[1970-01-01 00:00:00.000001Z,1970-01-01 00:00:00.000002Z,1970-01-01 00:00:00.000003Z]]

mapleFU · 2024-01-10T08:49:26Z

Yeah.

<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

Hmmm this regard isAdjustedToUTC=true, however, the col is col: timestamp[us]. So previously I think it's a bug :-(

Do you think we need an extra config, or just changing this?

jorisvandenbossche · 2024-01-10T09:22:02Z

Hmmm this regard isAdjustedToUTC=true, however, the col is col: timestamp[us]. So previously I think it's a bug :-(

As I tried to explain: no, it's not a bug, it was an explicit design decision (the links I posted above, #22303 / #4856). (but of course we can now disagree with an old design decision ;))

When creating the ParquetSchema from parsing the Parquet file/thrift, we did set isAdjustedToUTC=true, but we also added a is_from_converted_type=true (if it was created from a legacy file with ConvertedTypes). So the ParquetSchema follows the Parquet spec, but the on conversion of this ParsuetSchema to an ArrowSchema, we explicitly had logic to ignore isAdjustedToUTC when is_from_converted_type=true (that's the line of code you changed in the PR).

mapleFU · 2024-01-10T09:25:09Z

Ah I got your meaning, however, does this means the legacy file cannot have any timezone?

So fell free to revert the converting, or let me add a configurations🤔

jorisvandenbossche · 2024-01-10T09:30:55Z

does this means the legacy file cannot have any timezone?

Yes, reading such a legacy file would essentially never give you a timezone (but that was a generic limitation of the Parquet spec, not something Arrow could do anything about). Essentially back then the choice needed to be made between "always returning no timezone" (and potentially losing some information, if the original data was tz-aware) or "always returning with timezone" (and potentially adding wrong information, if the original data was not in UTC).
And so we made what we thought to be the "safest" choice that losing some information is better than adding wrong information.

mapleFU · 2024-01-10T09:34:26Z

I'll draft another patch which adding these in doc, and revert the change there docs/source/cpp/parquet.rst

joellubi · 2024-01-10T11:30:12Z

Thank you for adding those details @jorisvandenbossche, it really helped clarify the context of the previous decision. In thinking about this, it seems there may actually be two separate but related goals for the conversion:

Preservation of the original Parquet semantics when converting to Arrow
Preservation of the original Arrow type when a roundtrip occurs through Parquet and back to Arrow

Since the legacy Parquet types cannot distinguish between instant and local semantics, consumers have had to make an arbitrary choice between the two as you described. It does seem, at least now, that the Parquet spec has an explicit position that the deprecated convertedType timestamps always had instant semantics, and that local semantics were unsupported (source). This helps make the current position clear but doesn't change the reality of real-world Parquet usage that has not followed these semantics.

In the case that a Parquet file is being read and we don't know (or care) how it was produced, I do think it makes sense to follow compatibility guidelines (as the recently-merged PR has). This does break roundtrip behavior (thank you for pointing this out), but I do think Arrow already has an independent solution to this case. The metadata associated with the ARROW:schema key should be able to preserve this information on roundtrip, and I was actually surprised that it didn't in your example. I found this relevant code for how we currently handle the restoration of the original type. It seems that we will restore a particular timezone if it was present in the original schema, but do not remove the timezone if one was not present. Perhaps by independently updating this logic we can get the desired behavior for both scenarios.

Desired behavior IMO:

>>> pq.read_metadata("test_timestamp_convertedtype_writer_unknown.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_no_tz.parquet").schema.to_arrow_schema()
col: timestamp[us]

>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_with_tz.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

mapleFU · 2024-01-10T11:58:18Z

Guess a explicit cast in required in this scenerio...

jorisvandenbossche · 2024-01-10T12:09:18Z

I do think Arrow already has an independent solution to this case. The metadata associated with the ARROW:schema key should be able to preserve this information on roundtrip, and I was actually surprised that it didn't in your example.

This additional ARROW:schema metadata was only added around the same time that we started to support LogicalType (around Arrow 0.14.0), so unfortunately it doesn't help for this use case. When it is present, it indeed allows to preserve the actual timezone (also for logical types, which otherwise only knows if it is UTC or not).
But so when ARROW:schema is present, the file thus also has LogicalType annotations, and this whole discussion of course only applies to older files that only have ConvertedTypes.

So while I certainly agree with your "desired behaviour" being desired, it's not possible to distinguish with or without tz (even for files written by Arrow), because the only information there is in the file is the ConvertedType.

joellubi · 2024-01-10T12:48:36Z

This additional ARROW:schema metadata was only added around the same time that we started to support LogicalType (around Arrow 0.14.0), so unfortunately it doesn't help for this use case.

I see, indeed unfortunate. Would you be in favor of adding a field to ArrowReaderProperties to control this behavior?

…behavior to doc

jorisvandenbossche · 2024-01-12T09:46:24Z

Comment from @pitrou on the PR at #39491 (comment):

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15).

If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version.

We could indeed infer if the file was written by Arrow (the version shouldn't matter, I think, because in the end we care about whether the LogicalType is present or not, which is always the case starting from a certain version).
I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results.

But we would also need an actual file to test with.

Above (#39489 (comment)) I posted a small example, and updated that comment to attach the generated file that I used.
We do actually have a file in the pyarrow tests that covers this (python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet), but the test didn't catch that, because it only tested roundtrip to/from pandas, and for pandas it was using additional pandas metadata we did already store in the schema at the time, and so that still correctly roundtrips.

But, yes, in general we have a lack of testing for Parquet on this front: #22325

Would you be in favor of adding a field to ArrowReaderProperties to control this behavior?

To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries.

Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?)

And again: this whole change is maybe fine nowadays. I mostly wanted to point out this was not a bug, but a deliberate design decision in the past, and if we change this, we should do it deliberately and flag it as a breaking change.

joellubi · 2024-01-12T10:50:10Z

I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results.

I agree this could be confusing. For example datalakes are often written to by many potentially different producers so inconsistencies when reading could surface in this scenario.

To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries.

I think that an option to control the behavior would be valuable, but a default that doesn't prompt users to update their files or application logic probably isn't helpful. Do we use warnings or any other ways deprecating certain usage in other parts of Arrow?

Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?)

For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here.

mapleFU · 2024-01-15T03:32:03Z

For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here.

Anyway, I think I can document the behavior here.

…al Semantics (#39467) ### Rationale for this change Closes: #39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: #39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…d imply adjustedToUtc (apache#39491) ### Rationale for this change TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc. For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2]. In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`. Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior. 2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type. in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem: 1. If `ARROW:schema` extended is provided, nothing is wrong 2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type. Further more, this could happening when dataset filtering happens. [1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485 [2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443 [6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038 ### What changes are included in this PR? ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow ### Are these changes tested? Already ### Are there any user-facing changes? User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it. * Closes: apache#39489 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

…nt/Local Semantics (apache#39467) ### Rationale for this change Closes: apache#39466 ### What changes are included in this PR? - Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet. - Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules. - Refactor Timestamp serialization methods to reduce duplicated code ### Are these changes tested? Yes, - Logical type mapping in existing test updated. - New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled. - New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps. ### Are there any user-facing changes? Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations. The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++. * Closes: apache#39466 Authored-by: Joel Lubinitsky <joel@cherre.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

mapleFU · 2024-05-05T08:36:39Z

Re-close this

joellubi added the Type: bug label Jan 6, 2024

github-actions bot added Component: Parquet Component: C++ labels Jan 6, 2024

joellubi mentioned this issue Jan 6, 2024

GH-39466: [Go][Parquet] Align Arrow and Parquet Timestamp Instant/Local Semantics #39467

Merged

mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 7, 2024

apacheGH-39489: change ConvertType Thrift Timestamp to adjustUtc

6800756

github-actions bot mentioned this issue Jan 7, 2024

GH-39489: [C++][Parquet] ConvertedType TIME and TIMESTAMP should imply adjustedToUtc #39491

Merged

github-actions bot assigned mapleFU Jan 7, 2024

mapleFU closed this as completed in #39491 Jan 10, 2024

mapleFU added this to the 16.0.0 milestone Jan 10, 2024

mapleFU reopened this Jan 10, 2024

mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

be5c55a

…behavior to doc

mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

407b7d7

…behavior to doc

mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024

apacheGH-39489: [C++][Parquet] Revert apache#39491 and add timestamp …

e8e32c6

…behavior to doc

github-actions bot mentioned this issue Jan 11, 2024

GH-39489: [C++][Parquet] Revert #39491 and add timestamp behavior to doc #39571

Closed

raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024

mapleFU closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

joellubi commented Jan 6, 2024

mapleFU commented Jan 7, 2024

mapleFU commented Jan 7, 2024

jorisvandenbossche commented Jan 7, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading

mapleFU commented Jan 10, 2024 •

edited

Loading

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading

mapleFU commented Jan 10, 2024 •

edited

Loading

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

joellubi commented Jan 10, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024

joellubi commented Jan 10, 2024

jorisvandenbossche commented Jan 12, 2024

joellubi commented Jan 12, 2024

mapleFU commented Jan 15, 2024 •

edited

Loading

mapleFU commented May 5, 2024

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

Comments

joellubi commented Jan 6, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Jan 7, 2024

mapleFU commented Jan 7, 2024

jorisvandenbossche commented Jan 7, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024 • edited Loading

mapleFU commented Jan 10, 2024 • edited Loading

jorisvandenbossche commented Jan 10, 2024 • edited Loading

mapleFU commented Jan 10, 2024 • edited Loading

jorisvandenbossche commented Jan 10, 2024

mapleFU commented Jan 10, 2024

joellubi commented Jan 10, 2024

mapleFU commented Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024

joellubi commented Jan 10, 2024

jorisvandenbossche commented Jan 12, 2024

joellubi commented Jan 12, 2024

mapleFU commented Jan 15, 2024 • edited Loading

mapleFU commented May 5, 2024

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading

mapleFU commented Jan 10, 2024 •

edited

Loading

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading

mapleFU commented Jan 10, 2024 •

edited

Loading

mapleFU commented Jan 15, 2024 •

edited

Loading