PARQUET-1414: Limit page size based on maximum row count #531

gszadovszky · 2018-10-18T13:47:40Z

No description provided.

rdblue · 2018-10-18T17:06:35Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

@@ -183,6 +186,9 @@ public void close() {
  @Override
  public void endRecord() {
    ++rowCount;
+    if (rowCount >= rowCountForNextRowCountCheck) {


This is in a tight loop, so I think we should be a bit more careful. I think this check could go inside size check if statement if we guarantee that rowCountForNextSizeCheck <= rowCountForNextRowCountCheck by ensuring the size check count isn't larger than the row count check. Then we would only need to do one test for every row.

rdblue · 2018-10-19T16:48:46Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

@@ -190,13 +190,18 @@ public void endRecord() {

  private void sizeCheck() {
    long minRecordToWait = Long.MAX_VALUE;
+    long maxUnwrittenRows = 0;


The logic using maxUnwrittenRows is confusing to me. What does it mean for a row to be "unwritten"?

I think this should calculate the next size check row count more directly.

rdblue · 2018-11-07T17:04:01Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

        writer.writePage();
        remainingMem = props.getPageSizeThreshold();
+      } else {
+        rowCountForNextRowCountCheck = min(rowCountForNextRowCountCheck, rowCount + (pageRowCountLimit - rows));


@gszadovszky: The threshold for the page row count limit should be writer.getRowsWrittenSoFar() + pageRowCountLimit. What you have here is an indirect way to calculate the same value using the number of rows in this page as an intermediate.

I find this confusing and would like to see a follow-up PR fix it.

rdblue · 2018-11-07T17:05:49Z

@gszadovszky, @zivanfi, I understand going ahead with a commit when a reasonable amount of time has passed without one of the reviewers following up, but it has only been 2 days from the last update on this PR. Next time, please wait a little longer to give me a chance to review before committing with unfinished reviews.

zivanfi · 2018-11-07T17:22:01Z

@rdblue Sure, sorry about that. Next time we will give more time to review.

…che#531)" This reverts commit 1e0760a.

…che#531)" * Revert "PARQUET-1414: Simplify next row count check calculation (apache#537)" This reverts commit 7f561b6. * Revert "PARQUET-1414: Limit page size based on maximum row count (apache#531)" This reverts commit 1e0760a.

…unt (apache#531)"" This reverts commit 22a26ec.

Summary: This is to merge from upstream PARQUET-1305: Backward incompatible change introduced in 1.8 (apache#483) PARQUET-1452: Deprecate old logical types API (apache#535) PARQUET-1414: Simplify next row count check calculation (apache#537) PARQUET-1435: Benchmark filtering column-indexes (apache#536) PARQUET-1365: Don't write page level statistics (apache#549) Page level statistics were never used in production and became pointless after adding column indexes. PARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException (apache#548) The usage of static caching in the page index implementation did not allow using multiple readers at the same time. PARQUET-1407: Avro: Fix binary values returned from dictionary encoding (apache#552) * PARQUET-1407: Add test case for PARQUET-1407 to demonstrate the issue * PARQUET-1407: Fix binary values from dictionary encoding. Closes apache#551. PARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (apache#554) Experiment. Revert "Experiment." This reverts commit 97a880c. PARQUET-1434: Update CHANGES.md for 1.11.0 release. [maven-release-plugin] prepare release apache-parquet-1.11.0 [maven-release-plugin] prepare for next development iteration PARQUET-1461: Third party code does not compile after parquet-mr minor version update (apache#556) PARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2. PARQUET-1258: Update scm developer connection to github [maven-release-plugin] prepare release apache-parquet-1.11.0 [maven-release-plugin] prepare for next development iteration PARQUET-1462: Allow specifying new development version in prepare-release.sh (apache#557) Before this change, prepare-release.sh only took the release version as a parameter, the new development version was asked interactively for each individual pom.xml file, which made answering them tedious. PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (apache#562) PARQUET-1474: Less verbose and lower level logging for missing column/offset indexes (apache#563) PARQUET-1476: Don't emit a warning message for files without new logical type (apache#577) Update CHANGES.md for 1.11.0 release candidate 2. [maven-release-plugin] prepare release apache-parquet-1.11.0 [maven-release-plugin] prepare for next development iteration PARQUET-1478: Can't read spec compliant, 3-level lists via parquet-proto (apache#578) PARQUET-1489: Insufficient documentation for UserDefinedPredicate.keep(T) (apache#588) PARQUET-1487: Do not write original type for timezone-agnostic timestamps (apache#585) Update CHANGES.md for 1.11.0 release candidate 3. [maven-release-plugin] prepare release apache-parquet-1.11.0 [maven-release-plugin] prepare for next development iteration PARQUET-1490: Add branch-specific Travis steps (apache#590) The possiblity of branch-specific scripts allows feature branches to build SNAPSHOT versions of parquet-format (and depend on them in the POM files). Even if such branch-specific scripts get merged into master accidentally, they will not have any effect there. The script for the main branch checks the POM files to make sure that SNAPSHOT dependencies are not added to or merged into master accidentally. PARQUET-1280: [parquet-protobuf] Use maven protoc plugin (apache#506) PARQUET-1466: Upgrade to the latest guava 27.0-jre (apache#559) PARQUET-1475: Fix lack of cause propagation in DirectCodecFactory.ParquetCompressionCodecException. (apache#564) PARQUET-1492: Remove protobuf build (apache#592) We do not need to build protobuf (protoc) ourselves since we rely on maven protoc plugin to compile protobuf. This should save about 10 minutes travis build time (time for building protobuf itself). PARQUET-1498: Add instructions to install thrift via homebrew (apache#595) PARQUET-1502: Convert FIXED_LEN_BYTE_ARRAY to arrow type in logicalTypeAnnotation if it is not null (apache#593) [PARQUET-1506] Migrate maven-thrift-plugin to thrift-maven-plugin (apache#600) maven-thrift-plugin (Aug 13, 2013) https://mvnrepository.com/artifact/org.apache.thrift.tools/maven-thrift-plugin/0.1.11 thrift-maven-plugin (Jan 18, 2017) https://mvnrepository.com/artifact/org.apache.thrift/thrift-maven-plugin/0.10.0 The maven-thrift-plugin is the old one which has been migrated to the ASF and continued as thrift-maven-plugin: https://issues.apache.org/jira/browse/THRIFT-4083 [PARQUET-1500] Replace Closeables with try-with-resources (apache#597) PARQUET-1503: Remove Ints Utility Class (apache#598) PARQUET-1513: Update HiddenFileFilter to avoid extra startsWith (apache#606) PARQUET-1504: Add an option to convert Int96 to Arrow Timestamp (apache#594) PARQUET-1504: Add an option to convert Parquet Int96 to Arrow Timestamp PARQUET-1509: Note Hive deprecation in README. (apache#602) PARQUET-1510: Fix notEq for optional columns with null values. (apache#603) Dictionaries cannot contain null values, so notEq filters cannot conclude that a block cannot match using only the dictionary. Instead, it must also check whether the block may have at least one null value. If there are no null values, then the existing check is correct. [PARQUET-1507] Bump Apache Thrift to 0.12.0 (apache#601) PARQUET-1518: Use Jackson2 version 2.9.8 in parquet-cli (apache#609) There are some vulnerabilities: https://ossindex.sonatype.org/vuln/1205a1ec-0837-406f-b081-623b9fb02992 https://ossindex.sonatype.org/vuln/b85a00e3-7d9b-49cf-9b19-b73f8ee60275 https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c https://ossindex.sonatype.org/vuln/ab9013f0-09a2-4f01-bce5-751dc7437494 https://ossindex.sonatype.org/vuln/3f596fc0-9615-4b93-b30a-d4e0532e667f https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c PARQUET-138: Allow merging more restrictive field in less restrictive field (apache#550) * Allow merging more restrictive field in less restrictive field * Make class and function names more explicit Add javax.annotation-api dependency for JDK >= 9 (apache#604) PARQUET-1470: Inputstream leakage in ParquetFileWriter.appendFile (apache#611) PARQUET-1514: ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes (apache#607) PARQUET-1505: Use Java 7 NIO StandardCharsets (apache#599) PARQUET-1480 INT96 to avro not yet implemented error should mention deprecation (apache#579) PARQUET-1485: Fix Snappy direct memory leak (apache#581) PARQUET-1527: [parquet-tools] cat command throw java.lang.ClassCastException (apache#612) PARQUET-1529: Shade fastutil in all modules where used (apache#617) Update CHANGES.md for 1.11.0rc4 [maven-release-plugin] prepare release apache-parquet-1.11.0 [maven-release-plugin] prepare for next development iteration Merge from upstream Test Plan: Testing in Spark/Hive/Presto wil be performed before roll out to production! Reviewers: pavi, leisun Reviewed By: leisun Differential Revision: https://code.uberinternal.com/D2512639 Revert "PARQUET-1485: Fix Snappy direct memory leak (apache#581)" This reverts commit 7dcdcdc.

PARQUET-1414: Limit page size based on maximum row count

6a1f0ad

zivanfi approved these changes Oct 18, 2018

View reviewed changes

PARQUET-1414: Updates for review findings

1e4ebc6

zivanfi approved these changes Oct 18, 2018

View reviewed changes

rdblue reviewed Oct 18, 2018

View reviewed changes

PARQUET-1414: Merge rowCountCheck to sizeCheck

1d41a56

rdblue reviewed Oct 19, 2018

View reviewed changes

PARQUET-1414: Simplify sizeCheck

c316596

zivanfi approved these changes Nov 5, 2018

View reviewed changes

gszadovszky merged commit 1e0760a into apache:master Nov 7, 2018

rdblue reviewed Nov 7, 2018

View reviewed changes

mccheah added a commit to mccheah/parquet-mr that referenced this pull request Feb 13, 2019

Revert "PARQUET-1414: Limit page size based on maximum row count (apa…

99f880b

…che#531)" This reverts commit 1e0760a.

yifeih added a commit to palantir/parquet-mr that referenced this pull request Mar 13, 2019

Revert "Revert "PARQUET-1414: Limit page size based on maximum row co…

f64ff19

…unt (apache#531)"" This reverts commit 22a26ec.

asfimport mentioned this pull request Jun 23, 2024

Limit page size based on maximum row count #2227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1414: Limit page size based on maximum row count #531

PARQUET-1414: Limit page size based on maximum row count #531

gszadovszky commented Oct 18, 2018

rdblue Oct 18, 2018

rdblue Oct 19, 2018

rdblue Nov 7, 2018

rdblue commented Nov 7, 2018

zivanfi commented Nov 7, 2018

PARQUET-1414: Limit page size based on maximum row count #531

PARQUET-1414: Limit page size based on maximum row count #531

Conversation

gszadovszky commented Oct 18, 2018

rdblue Oct 18, 2018

Choose a reason for hiding this comment

rdblue Oct 19, 2018

Choose a reason for hiding this comment

rdblue Nov 7, 2018

Choose a reason for hiding this comment

rdblue commented Nov 7, 2018

zivanfi commented Nov 7, 2018