Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-138: Allow merging more restrictive field in less restrictive field #550

Merged
merged 2 commits into from
Feb 1, 2019
Merged

PARQUET-138: Allow merging more restrictive field in less restrictive field #550

merged 2 commits into from
Feb 1, 2019

Conversation

ntrinquier
Copy link
Contributor

@ntrinquier ntrinquier commented Nov 14, 2018

@ntrinquier
Copy link
Contributor Author

ntrinquier commented Nov 14, 2018 via email

@rdblue
Copy link
Contributor

rdblue commented Nov 14, 2018

I see no reason why union would not by symmetric. Otherwise, this is just changing the direction, which isn't very useful.

@ntrinquier
Copy link
Contributor Author

I see no reason either, actually. Updated the PR to change that!

Copy link
Contributor

@nandorKollar nandorKollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue
Copy link
Contributor

rdblue commented Nov 20, 2018

@julienledem, can you think of any problems that may result from making this change? I think it just allows unioning more types together, but I'm not sure where exactly we rely on this and whether parts assume the old behavior.

@tpmusielak
Copy link

Hey team,

I have came across this issue in my project and I was wondering if you are likely to make this change anytime soon?

Thanks!

@rdblue
Copy link
Contributor

rdblue commented Nov 26, 2018

@tpmusielak, we need to make sure we don't think this behavior change will cause problems first. I'd like to see it in the next release.

@mccheah
Copy link

mccheah commented Nov 29, 2018

What's required to get this in for the next release? Do we need more tests to ensure we don't break existing behavior?

@ntrinquier
Copy link
Contributor Author

Can we put this behind a flag to stay backward compatible and users can opt-in to use it?

@vinooganesh
Copy link
Contributor

Hey @julienledem @rdblue - friendly ping here and agreed it would be nice to get this into the next release. If this looks good, can we merge?

@rdblue rdblue merged commit 51c4cc3 into apache:master Feb 1, 2019
@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2019

Thanks for the reminder, @vinooganesh. I see no reason why it would not be safe for union to return the same result when operand position is swapped.

@ntrinquier ntrinquier deleted the PARQUET-138 branch February 1, 2019 22:08
@boriskar
Copy link

boriskar commented Mar 6, 2019

hey guys,
so in what pyspark version this will work?
I have a problem to recreate a summary file after writing more restrictive schema into a less restrictive one

@ntrinquier
Copy link
Contributor Author

@boriskar I guess this is more a question for apache/spark. This change is available in parquet 1.11.0 or higher, so spark needs to bump its version to that (it's currently on 1.10.1).

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2023
Summary:
This is to merge from upstream
PARQUET-1305: Backward incompatible change introduced in 1.8 (apache#483)

PARQUET-1452: Deprecate old logical types API (apache#535)

PARQUET-1414: Simplify next row count check calculation (apache#537)

PARQUET-1435: Benchmark filtering column-indexes (apache#536)

PARQUET-1365: Don't write page level statistics (apache#549)

Page level statistics were never used in production and became pointless after adding column indexes.

PARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException (apache#548)

The usage of static caching in the page index implementation did not allow using multiple readers at the same time.

PARQUET-1407: Avro: Fix binary values returned from dictionary encoding (apache#552)

* PARQUET-1407: Add test case for PARQUET-1407 to demonstrate the issue
* PARQUET-1407: Fix binary values from dictionary encoding.

Closes apache#551.

PARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (apache#554)

Experiment.

Revert "Experiment."

This reverts commit 97a880c.

PARQUET-1434: Update CHANGES.md for 1.11.0 release.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1461: Third party code does not compile after parquet-mr minor version update (apache#556)

PARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2.

PARQUET-1258: Update scm developer connection to github

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1462: Allow specifying new development version in prepare-release.sh (apache#557)

Before this change, prepare-release.sh only took the release version as a
parameter, the new development version was asked interactively for each
individual pom.xml file, which made answering them tedious.

PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (apache#562)

PARQUET-1474: Less verbose and lower level logging for missing column/offset indexes (apache#563)

PARQUET-1476: Don't emit a warning message for files without new logical type (apache#577)

Update CHANGES.md for 1.11.0 release candidate 2.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1478: Can't read spec compliant, 3-level lists via parquet-proto (apache#578)

PARQUET-1489: Insufficient documentation for UserDefinedPredicate.keep(T) (apache#588)

PARQUET-1487: Do not write original type for timezone-agnostic timestamps (apache#585)

Update CHANGES.md for 1.11.0 release candidate 3.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1490: Add branch-specific Travis steps (apache#590)

The possiblity of branch-specific scripts allows feature branches to build
SNAPSHOT versions of parquet-format (and depend on them in the POM files). Even
if such branch-specific scripts get merged into master accidentally, they will
not have any effect there.

The script for the main branch checks the POM files to make sure that SNAPSHOT
dependencies are not added to or merged into master accidentally.

PARQUET-1280: [parquet-protobuf] Use maven protoc plugin (apache#506)

PARQUET-1466: Upgrade to the latest guava 27.0-jre (apache#559)

PARQUET-1475: Fix lack of cause propagation in DirectCodecFactory.ParquetCompressionCodecException. (apache#564)

PARQUET-1492: Remove protobuf build (apache#592)

We do not need to build protobuf (protoc) ourselves since we rely on maven protoc plugin to compile protobuf.
This should save about 10 minutes travis build time (time for building protobuf itself).

PARQUET-1498: Add instructions to install thrift via homebrew (apache#595)

PARQUET-1502: Convert FIXED_LEN_BYTE_ARRAY to arrow type in logicalTypeAnnotation if it is not null (apache#593)

[PARQUET-1506] Migrate  maven-thrift-plugin to thrift-maven-plugin (apache#600)

maven-thrift-plugin (Aug 13, 2013) https://mvnrepository.com/artifact/org.apache.thrift.tools/maven-thrift-plugin/0.1.11
thrift-maven-plugin (Jan 18, 2017) https://mvnrepository.com/artifact/org.apache.thrift/thrift-maven-plugin/0.10.0

The maven-thrift-plugin is the old one which has been migrated to the ASF
and continued as thrift-maven-plugin:
https://issues.apache.org/jira/browse/THRIFT-4083

[PARQUET-1500] Replace Closeables with try-with-resources (apache#597)

PARQUET-1503: Remove Ints Utility Class (apache#598)

PARQUET-1513: Update HiddenFileFilter to avoid extra startsWith (apache#606)

PARQUET-1504: Add an option to convert Int96 to Arrow Timestamp (apache#594)

PARQUET-1504: Add an option to convert Parquet Int96 to Arrow Timestamp

PARQUET-1509: Note Hive deprecation in README. (apache#602)

PARQUET-1510: Fix notEq for optional columns with null values. (apache#603)

Dictionaries cannot contain null values, so notEq filters cannot
conclude that a block cannot match using only the dictionary. Instead,
it must also check whether the block may have at least one null value.
If there are no null values, then the existing check is correct.

[PARQUET-1507] Bump Apache Thrift to 0.12.0 (apache#601)

PARQUET-1518: Use Jackson2 version 2.9.8 in parquet-cli (apache#609)

There are some vulnerabilities:
https://ossindex.sonatype.org/vuln/1205a1ec-0837-406f-b081-623b9fb02992
https://ossindex.sonatype.org/vuln/b85a00e3-7d9b-49cf-9b19-b73f8ee60275
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c
https://ossindex.sonatype.org/vuln/ab9013f0-09a2-4f01-bce5-751dc7437494
https://ossindex.sonatype.org/vuln/3f596fc0-9615-4b93-b30a-d4e0532e667f
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c

PARQUET-138: Allow merging more restrictive field in less restrictive field (apache#550)

* Allow merging more restrictive field in less restrictive field
* Make class and function names more explicit

Add javax.annotation-api dependency for JDK >= 9 (apache#604)

PARQUET-1470: Inputstream leakage in ParquetFileWriter.appendFile (apache#611)

PARQUET-1514: ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes (apache#607)

PARQUET-1505: Use Java 7 NIO StandardCharsets (apache#599)

PARQUET-1480 INT96 to avro not yet implemented error should mention deprecation (apache#579)

PARQUET-1485: Fix Snappy direct memory leak (apache#581)

PARQUET-1527:  [parquet-tools] cat command throw java.lang.ClassCastException (apache#612)

PARQUET-1529: Shade fastutil in all modules where used (apache#617)

Update CHANGES.md for 1.11.0rc4

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

Merge from upstream

Test Plan: Testing in Spark/Hive/Presto wil be performed before roll out to production!

Reviewers: pavi, leisun

Reviewed By: leisun

Differential Revision: https://code.uberinternal.com/D2512639

Revert "PARQUET-1485: Fix Snappy direct memory leak (apache#581)"

This reverts commit 7dcdcdc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants