Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5889: [C++][Parquet] Add property to indicate origin from converted type to TimestampLogicalType #4856

Closed
wants to merge 1 commit into from

Conversation

tpboudreau
Copy link
Contributor

This patch:

  • adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
  • prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
  • changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Also tested it locally and can confirm this now properly reads older files in pyarrow.

@tpboudreau
Copy link
Contributor Author

Thanks @jorisvandenbossche for testing it out. When you say it works on older files, did you use the files attached to the Jira issue by the reporter? I've been considering adding a test using those as fixtures. Thanks again.

@jorisvandenbossche
Copy link
Member

Yes, I checked with those files (as well as something I wrote myself with an older version). I certainly think we should add such tests reading older files (I just opened https://issues.apache.org/jira/browse/ARROW-5915 for that).
But, if we do it like that, we should probably write a script that generates such files (so that it is reproducible) instead of adding the files attached to the issue. And also, we should discuss how to do a more fully integrated forward+backwards compatibility testing (although IMO that does not prevent adding some basic back compat tests)

@tpboudreau
Copy link
Contributor Author

Great. Thanks for testing the reporter's files.

Thanks also for opening that issue for compatibility testing -- I generally agree with you -- I'll comment further on that Jira issue when I get the chance.

@wesm
Copy link
Member

wesm commented Jul 12, 2019

@tpboudreau this needs to be rebased now, let me know if you need help

@tpboudreau
Copy link
Contributor Author

Rebased and RFR (I'll monitor the TravisCI build after it kicks off).

@codecov-io
Copy link

Codecov Report

Merging #4856 into master will decrease coverage by 11.39%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4856       +/-   ##
===========================================
- Coverage   87.08%   75.68%    -11.4%     
===========================================
  Files         883       58      -825     
  Lines      116941     3735   -113206     
  Branches     1418        0     -1418     
===========================================
- Hits       101837     2827    -99010     
+ Misses      14742      908    -13834     
+ Partials      362        0      -362
Impacted Files Coverage Δ
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
go/arrow/memory/memory_avx2_amd64.go
cpp/src/arrow/io/test-common.h
cpp/src/arrow/util/int-util-test.cc
cpp/src/parquet/column_scanner.cc
...p/src/gandiva/precompiled/epoch_time_point_test.cc
cpp/src/parquet/thrift.h
... and 810 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e70357...7b51bba. Read the comment docs.

bool force_set_converted_type = false);
bool is_adjusted_to_utc() const;
LogicalType::TimeUnit::unit time_unit() const;

/// \brief If true, will not set LogicalType in Thrift metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true anymore -- I'll remove this comment and the is_serialized method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, GitHub was hiding files from me so this is indeed what is going on in this patch. Will take a closer look

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I see that the is_serialized override just allows us to round-trip the metadata faithfully. If a file is read to Arrow, then written back, then the logical type will be written (I think)

@wesm wesm closed this in 99b9702 Jul 13, 2019
kszucs pushed a commit that referenced this pull request Jul 16, 2019
…rted type to TimestampLogicalType

This patch:
 - adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
- prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
- changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Author: TP Boudreau <tpboudreau@gmail.com>

Closes #4856 from tpboudreau/ARROW-5889 and squashes the following commits:

458f063 <TP Boudreau> Add property showing converted type origin to TimestampLogicalType
kszucs pushed a commit that referenced this pull request Jul 16, 2019
…rted type to TimestampLogicalType

This patch:
 - adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
- prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
- changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Author: TP Boudreau <tpboudreau@gmail.com>

Closes #4856 from tpboudreau/ARROW-5889 and squashes the following commits:

458f063 <TP Boudreau> Add property showing converted type origin to TimestampLogicalType
pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Aug 11, 2019
…rted type to TimestampLogicalType

This patch:
 - adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
- prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
- changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Author: TP Boudreau <tpboudreau@gmail.com>

Closes apache#4856 from tpboudreau/ARROW-5889 and squashes the following commits:

458f063 <TP Boudreau> Add property showing converted type origin to TimestampLogicalType
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants