Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType #39489

Closed
joellubi opened this issue Jan 6, 2024 · 20 comments · Fixed by #39491

Comments

@joellubi
Copy link
Member

joellubi commented Jan 6, 2024

Describe the bug, including details regarding any error messages, version, and platform.

The Parquet format document indicates that converted types should have isAdjustedToUtc=true for backward compatibility, but the current C++ implementation sets this to false for converted types.

The Rust implementation already follows the spec as described and the Go implementation has #39467 open to correct this.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Jan 7, 2024

Thanks for point this out!

@mapleFU
Copy link
Member

mapleFU commented Jan 7, 2024

@joellubi I found this is only called in the code path below:

FromInt64Statistics/GetArrowType
- FromInt64
-- MakeArrowTimestamp

I guess this will not frequently call in code path, because:

  1. FromInt64Statistics is used for filtering, and this might making filtering invalid
  2. GetArrowType is harmful, but if the type is annotated, it might not use this?

@jorisvandenbossche
Copy link
Member

Some related discussions:

Now I suppose that because of those mentioned issues, when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

mapleFU added a commit that referenced this issue Jan 10, 2024
…y adjustedToUtc (#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: #39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 16.0.0 milestone Jan 10, 2024
@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

Now I suppose that because of those mentioned issues, when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

@jorisvandenbossche Yes. As I mentioned, this might only happens when:

  1. Filtering ( Statistics might got the wrong time )
  2. Read a legacy file in parquet-mr

@jorisvandenbossche
Copy link
Member

Actually, my comment above was wrong:

when converting a TimestampLogicalType to an Arrow timestamp type, we do ignore the isAdjustedToUtc flag if the logical type has been annotated to come from a converted type. So then changing isAdjustedToUtc in those cases shouldn't actually impact any behaviour in the Arrow API layer of the Parquet library?

This is exactly what your PR changed, i.e. it's the conversion from a Parquet type to Arrow type (MakeArrowTimestamp) that now no longer ignores isAdjustedToUtc for converted types.
So of course this does change behaviour, also for the Arrow APIs.

And so the consequence of this is that reading an older file written by Arrow, will now be read differently (and this was a compatibility that was explicitly added in #4856).

It's maybe OK to change this behaviour, so we give up faithful roundtrip for those older files written by Arrow (in the end, it are old files), in favor of better compatibility with the Parquet spec and other producers, but we need to be explicit about that this is actually a behaviour change for users.

@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

Yes, we can document it before we releasing 16.0

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 10, 2024

To make it concrete: assume I have a file written with pyarrow 0.11 for data in Arrow with tz-naive timestamps (uploaded here as a zip: test_timestamp_pa011.zip):

>>> import pyarrow as pa 
>>> import pyarrow.parquet as pq  
>>> pa.__version__
'0.11.1'
>>> table = pa.Table.from_arrays([pa.array([1, 2, 3], pa.timestamp("us"))], ['col'])    
>>> pq.write_table(table, "test_timestamp_pa011.parquet")    

and then we read it with latest released pyarrow, we faithfully roundtrip the Arrow data:

>>> pa.__version__
 '14.0.1'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us]

>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:00.000001,1970-01-01 00:00:00.000002,1970-01-01 00:00:00.000003]]

But now with latest main (after #39491), the read data is no longer the same as the original data (the type now says that the data is in UTC, while the original data were local timestamps in an unknown timezone):

>>> pa.__version__
'15.0.0.dev402+gc752bdb08'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f6969c44c80>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us, tz=UTC]
----
col: [[1970-01-01 00:00:00.000001Z,1970-01-01 00:00:00.000002Z,1970-01-01 00:00:00.000003Z]]

@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

Yeah.

<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=true, force_set_converted_type=false));
}

Hmmm this regard isAdjustedToUTC=true, however, the col is col: timestamp[us]. So previously I think it's a bug :-(

Do you think we need an extra config, or just changing this?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 10, 2024

Hmmm this regard isAdjustedToUTC=true, however, the col is col: timestamp[us]. So previously I think it's a bug :-(

As I tried to explain: no, it's not a bug, it was an explicit design decision (the links I posted above, #22303 / #4856). (but of course we can now disagree with an old design decision ;))

When creating the ParquetSchema from parsing the Parquet file/thrift, we did set isAdjustedToUTC=true, but we also added a is_from_converted_type=true (if it was created from a legacy file with ConvertedTypes). So the ParquetSchema follows the Parquet spec, but the on conversion of this ParsuetSchema to an ArrowSchema, we explicitly had logic to ignore isAdjustedToUTC when is_from_converted_type=true (that's the line of code you changed in the PR).

@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

Ah I got your meaning, however, does this means the legacy file cannot have any timezone?

So fell free to revert the converting, or let me add a configurations🤔

@jorisvandenbossche
Copy link
Member

does this means the legacy file cannot have any timezone?

Yes, reading such a legacy file would essentially never give you a timezone (but that was a generic limitation of the Parquet spec, not something Arrow could do anything about). Essentially back then the choice needed to be made between "always returning no timezone" (and potentially losing some information, if the original data was tz-aware) or "always returning with timezone" (and potentially adding wrong information, if the original data was not in UTC).
And so we made what we thought to be the "safest" choice that losing some information is better than adding wrong information.

@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

I'll draft another patch which adding these in doc, and revert the change there docs/source/cpp/parquet.rst

@mapleFU mapleFU reopened this Jan 10, 2024
@joellubi
Copy link
Member Author

Thank you for adding those details @jorisvandenbossche, it really helped clarify the context of the previous decision. In thinking about this, it seems there may actually be two separate but related goals for the conversion:

  1. Preservation of the original Parquet semantics when converting to Arrow
  2. Preservation of the original Arrow type when a roundtrip occurs through Parquet and back to Arrow

Since the legacy Parquet types cannot distinguish between instant and local semantics, consumers have had to make an arbitrary choice between the two as you described. It does seem, at least now, that the Parquet spec has an explicit position that the deprecated convertedType timestamps always had instant semantics, and that local semantics were unsupported (source). This helps make the current position clear but doesn't change the reality of real-world Parquet usage that has not followed these semantics.

In the case that a Parquet file is being read and we don't know (or care) how it was produced, I do think it makes sense to follow compatibility guidelines (as the recently-merged PR has). This does break roundtrip behavior (thank you for pointing this out), but I do think Arrow already has an independent solution to this case. The metadata associated with the ARROW:schema key should be able to preserve this information on roundtrip, and I was actually surprised that it didn't in your example. I found this relevant code for how we currently handle the restoration of the original type. It seems that we will restore a particular timezone if it was present in the original schema, but do not remove the timezone if one was not present. Perhaps by independently updating this logic we can get the desired behavior for both scenarios.

Desired behavior IMO:

>>> pq.read_metadata("test_timestamp_convertedtype_writer_unknown.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_no_tz.parquet").schema.to_arrow_schema()
col: timestamp[us]

>>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_with_tz.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]

@mapleFU
Copy link
Member

mapleFU commented Jan 10, 2024

Guess a explicit cast in required in this scenerio...

@jorisvandenbossche
Copy link
Member

I do think Arrow already has an independent solution to this case. The metadata associated with the ARROW:schema key should be able to preserve this information on roundtrip, and I was actually surprised that it didn't in your example.

This additional ARROW:schema metadata was only added around the same time that we started to support LogicalType (around Arrow 0.14.0), so unfortunately it doesn't help for this use case. When it is present, it indeed allows to preserve the actual timezone (also for logical types, which otherwise only knows if it is UTC or not).
But so when ARROW:schema is present, the file thus also has LogicalType annotations, and this whole discussion of course only applies to older files that only have ConvertedTypes.

So while I certainly agree with your "desired behaviour" being desired, it's not possible to distinguish with or without tz (even for files written by Arrow), because the only information there is in the file is the ConvertedType.

@joellubi
Copy link
Member Author

This additional ARROW:schema metadata was only added around the same time that we started to support LogicalType (around Arrow 0.14.0), so unfortunately it doesn't help for this use case.

I see, indeed unfortunate. Would you be in favor of adding a field to ArrowReaderProperties to control this behavior?

mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024
mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024
mapleFU added a commit to mapleFU/arrow that referenced this issue Jan 11, 2024
@jorisvandenbossche
Copy link
Member

Comment from @pitrou on the PR at #39491 (comment):

Nowadays it is maybe OK to make a different trade-off, and to prioritize the Parquet spec / non-Arrow legacy files above compatibility of reading Arrow legacy files, given that it is about files written by an Arrow version of several years old (Arrow < 0.15).

If we really care about this, we should perhaps detect the offending Arrow version, since the Parquet file records the producer name and version.

We could indeed infer if the file was written by Arrow (the version shouldn't matter, I think, because in the end we care about whether the LogicalType is present or not, which is always the case starting from a certain version).
I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results.

But we would also need an actual file to test with.

Above (#39489 (comment)) I posted a small example, and updated that comment to attach the generated file that I used.
We do actually have a file in the pyarrow tests that covers this (python/pyarrow/tests/data/parquet/v0.7.1.column-metadata-handling.parquet), but the test didn't catch that, because it only tested roundtrip to/from pandas, and for pandas it was using additional pandas metadata we did already store in the schema at the time, and so that still correctly roundtrips.

But, yes, in general we have a lack of testing for Parquet on this front: #22325


Would you be in favor of adding a field to ArrowReaderProperties to control this behavior?

To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries.

Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?)

And again: this whole change is maybe fine nowadays. I mostly wanted to point out this was not a bug, but a deliberate design decision in the past, and if we change this, we should do it deliberately and flag it as a breaking change.

@joellubi
Copy link
Member Author

I don't know if that wouldn't make things more complex / confusing for users though, that a similar file (just by a different producer) would give different results.

I agree this could be confusing. For example datalakes are often written to by many potentially different producers so inconsistencies when reading could surface in this scenario.

To be honest, I am not sure what would be the best option. Add a way to control the behaviour of course make this more flexible for the user/library using arrow, but 1) is it still needed to add this given it is for legacy files? (I don't know how many producers still generate them), and 2) we still need to make a choice of a default in the Arrow libraries.

I think that an option to control the behavior would be valuable, but a default that doesn't prompt users to update their files or application logic probably isn't helpful. Do we use warnings or any other ways deprecating certain usage in other parts of Arrow?

Assume we would keep the default in Arrow as it was before, do you have applications that would set it differently? (maybe yes, as that might have triggered this discussion?)

For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here.

@mapleFU
Copy link
Member

mapleFU commented Jan 15, 2024

For what it's worth, I do not. I was having issues in an application using the Go Parquet implementation which had a bug different from the logic here. As part of updating the Go implementation, I referred to the Parquet format document as well as a few other existing implementations which is when I noticed this inconsistency here.

Anyway, I think I can document the behavior here.

zeroshade pushed a commit that referenced this issue Jan 18, 2024
…al Semantics (#39467)

### Rationale for this change

Closes: #39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and #39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: #39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…d imply adjustedToUtc (apache#39491)

### Rationale for this change

TLDR: When reading from a Parquet file with legacy timestamp not written by arrow, `isAdjustedToUTC` would be ignored during read. And when filtering a file like this, filtering would not work.

When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy timestamp, the legacy timestamp **do not** having a `adjustedToUtc` flag. So, for forward compatible, when reading it we need to regard it as `adjustedToUtc` ( A UTC Timestamp). See [1] [2].

In our implementation, we use a `(LogicalType, PhysicalType)` pair, `LogicalType` it's a wrapper for parquet thrift `ConvertedType` and `LogicalType`, the code is listed in [3] : 
1. During write, we always use `FieldToNode` and `TimestampLogicalTypeFromArrowTimestamp` [4] to generate a `TimestampLogicalType`, and generate `isAdjustedUtc` by `::arrow::TimestampType::timezone()`.  Also, the arrow type will be a `ARROW:schema` extended key-value, and written in parquet file. During write, we'll both set logicaltype and convertType. `arrow_properties.coerce_timestamps_enabled()` and `arrow_properties.support_deprecated_int96_timestamps()` control the behavior.
2. During read, we'll first parse parquet thrift and generate a `LogicalType`[5]. And then, `FromParquetSchema`[6] would be called for arrow type.

in 1-5, code works well, because LogicalType parsing works well. However, [6] might causing some problem:
1. If `ARROW:schema` extended is provided, nothing is wrong
2. **Otherwise, timestamp would be wrong because it ignore isAdjustedUtc!**. So when it read legacy Parquet file not written by arrow, it would not have right type.

Further more, this could happening when dataset filtering happens.

[1] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[2] https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L128-L283
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L140
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/schema.cc#L443
[6] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1038

### What changes are included in this PR?

ConvertedType TIME and TIMESTAMP should imply adjustedToUtc when casting to arrow

### Are these changes tested?

Already

### Are there any user-facing changes?

User might got some behavior change during read the legacy timestamp. But most of time, arrow would casting it.

* Closes: apache#39489 

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…nt/Local Semantics (apache#39467)

### Rationale for this change

Closes: apache#39466 

### What changes are included in this PR?

- Update logic for determining whether an Arrow Timestamp should have `isAdjustedToUTC=true` on conversion to Parquet.
- Update conversion from Parquet Timestamp to Arrow Timestamp to align with Parquet Format [backward-compatibilty](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485) rules.
- Refactor Timestamp serialization methods to reduce duplicated code

### Are these changes tested?

Yes,
- Logical type mapping in existing test updated.
- New tests for roundtrip behavior of timestamps with various timezone settings, with/without store_schema enabled.
- New test to clarify equality behavior of timestamps with instant semantics, as well as Go-related quirks with timezone-unaware timestamps.

### Are there any user-facing changes?

Yes, users of `pqarrow.FileWriter` will produce Parquet files in which the `TIMESTAMP` type is normalized to UTC IFF the Arrow type provided has a timezone specified. This is different from the current Go behavior but aligned that of other implementations.

The conversion from Parquet to Arrow has been updated as well to reflect the Parquet format [document](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485). Rust already [implements](https://github.com/apache/arrow-rs/blob/a61e824abdd7b38ea214828480430ff2a13f2ead/parquet/src/arrow/schema/primitive.rs#L211-L239) the spec as described and apache#39489 has been reported due to a mismatch in the handling of convertedTypes in C++.

* Closes: apache#39466

Authored-by: Joel Lubinitsky <joel@cherre.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@raulcd raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024
@mapleFU
Copy link
Member

mapleFU commented May 5, 2024

Re-close this

@mapleFU mapleFU closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment