Skip to content

Commit

Permalink
PARQUET-1414: Limit page size based on maximum row count (apache#531)
Browse files Browse the repository at this point in the history
Summary:
This is to merge from upstream
PARQUET-1305: Backward incompatible change introduced in 1.8 (apache#483)

PARQUET-1452: Deprecate old logical types API (apache#535)

PARQUET-1414: Simplify next row count check calculation (apache#537)

PARQUET-1435: Benchmark filtering column-indexes (apache#536)

PARQUET-1365: Don't write page level statistics (apache#549)

Page level statistics were never used in production and became pointless after adding column indexes.

PARQUET-1456: Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException (apache#548)

The usage of static caching in the page index implementation did not allow using multiple readers at the same time.

PARQUET-1407: Avro: Fix binary values returned from dictionary encoding (apache#552)

* PARQUET-1407: Add test case for PARQUET-1407 to demonstrate the issue
* PARQUET-1407: Fix binary values from dictionary encoding.

Closes apache#551.

PARQUET-1460: Fix javadoc errors and include javadoc checking in Travis checks (apache#554)

Experiment.

Revert "Experiment."

This reverts commit 97a880c.

PARQUET-1434: Update CHANGES.md for 1.11.0 release.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1461: Third party code does not compile after parquet-mr minor version update (apache#556)

PARQUET-1434: Update CHANGES.md for 1.11.0 release candidate 2.

PARQUET-1258: Update scm developer connection to github

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1462: Allow specifying new development version in prepare-release.sh (apache#557)

Before this change, prepare-release.sh only took the release version as a
parameter, the new development version was asked interactively for each
individual pom.xml file, which made answering them tedious.

PARQUET-1472: Dictionary filter fails on FIXED_LEN_BYTE_ARRAY (apache#562)

PARQUET-1474: Less verbose and lower level logging for missing column/offset indexes (apache#563)

PARQUET-1476: Don't emit a warning message for files without new logical type (apache#577)

Update CHANGES.md for 1.11.0 release candidate 2.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1478: Can't read spec compliant, 3-level lists via parquet-proto (apache#578)

PARQUET-1489: Insufficient documentation for UserDefinedPredicate.keep(T) (apache#588)

PARQUET-1487: Do not write original type for timezone-agnostic timestamps (apache#585)

Update CHANGES.md for 1.11.0 release candidate 3.

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

PARQUET-1490: Add branch-specific Travis steps (apache#590)

The possiblity of branch-specific scripts allows feature branches to build
SNAPSHOT versions of parquet-format (and depend on them in the POM files). Even
if such branch-specific scripts get merged into master accidentally, they will
not have any effect there.

The script for the main branch checks the POM files to make sure that SNAPSHOT
dependencies are not added to or merged into master accidentally.

PARQUET-1280: [parquet-protobuf] Use maven protoc plugin (apache#506)

PARQUET-1466: Upgrade to the latest guava 27.0-jre (apache#559)

PARQUET-1475: Fix lack of cause propagation in DirectCodecFactory.ParquetCompressionCodecException. (apache#564)

PARQUET-1492: Remove protobuf build (apache#592)

We do not need to build protobuf (protoc) ourselves since we rely on maven protoc plugin to compile protobuf.
This should save about 10 minutes travis build time (time for building protobuf itself).

PARQUET-1498: Add instructions to install thrift via homebrew (apache#595)

PARQUET-1502: Convert FIXED_LEN_BYTE_ARRAY to arrow type in logicalTypeAnnotation if it is not null (apache#593)

[PARQUET-1506] Migrate  maven-thrift-plugin to thrift-maven-plugin (apache#600)

maven-thrift-plugin (Aug 13, 2013) https://mvnrepository.com/artifact/org.apache.thrift.tools/maven-thrift-plugin/0.1.11
thrift-maven-plugin (Jan 18, 2017) https://mvnrepository.com/artifact/org.apache.thrift/thrift-maven-plugin/0.10.0

The maven-thrift-plugin is the old one which has been migrated to the ASF
and continued as thrift-maven-plugin:
https://issues.apache.org/jira/browse/THRIFT-4083

[PARQUET-1500] Replace Closeables with try-with-resources (apache#597)

PARQUET-1503: Remove Ints Utility Class (apache#598)

PARQUET-1513: Update HiddenFileFilter to avoid extra startsWith (apache#606)

PARQUET-1504: Add an option to convert Int96 to Arrow Timestamp (apache#594)

PARQUET-1504: Add an option to convert Parquet Int96 to Arrow Timestamp

PARQUET-1509: Note Hive deprecation in README. (apache#602)

PARQUET-1510: Fix notEq for optional columns with null values. (apache#603)

Dictionaries cannot contain null values, so notEq filters cannot
conclude that a block cannot match using only the dictionary. Instead,
it must also check whether the block may have at least one null value.
If there are no null values, then the existing check is correct.

[PARQUET-1507] Bump Apache Thrift to 0.12.0 (apache#601)

PARQUET-1518: Use Jackson2 version 2.9.8 in parquet-cli (apache#609)

There are some vulnerabilities:
https://ossindex.sonatype.org/vuln/1205a1ec-0837-406f-b081-623b9fb02992
https://ossindex.sonatype.org/vuln/b85a00e3-7d9b-49cf-9b19-b73f8ee60275
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c
https://ossindex.sonatype.org/vuln/ab9013f0-09a2-4f01-bce5-751dc7437494
https://ossindex.sonatype.org/vuln/3f596fc0-9615-4b93-b30a-d4e0532e667f
https://ossindex.sonatype.org/vuln/4f7e98ad-2212-45d3-ac21-089b3b082e6c

PARQUET-138: Allow merging more restrictive field in less restrictive field (apache#550)

* Allow merging more restrictive field in less restrictive field
* Make class and function names more explicit

Add javax.annotation-api dependency for JDK >= 9 (apache#604)

PARQUET-1470: Inputstream leakage in ParquetFileWriter.appendFile (apache#611)

PARQUET-1514: ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes (apache#607)

PARQUET-1505: Use Java 7 NIO StandardCharsets (apache#599)

PARQUET-1480 INT96 to avro not yet implemented error should mention deprecation (apache#579)

PARQUET-1485: Fix Snappy direct memory leak (apache#581)

PARQUET-1527:  [parquet-tools] cat command throw java.lang.ClassCastException (apache#612)

PARQUET-1529: Shade fastutil in all modules where used (apache#617)

Update CHANGES.md for 1.11.0rc4

[maven-release-plugin] prepare release apache-parquet-1.11.0

[maven-release-plugin] prepare for next development iteration

Merge from upstream

Test Plan: Testing in Spark/Hive/Presto wil be performed before roll out to production!

Reviewers: pavi, leisun

Reviewed By: leisun

Differential Revision: https://code.uberinternal.com/D2512639

Revert "PARQUET-1485: Fix Snappy direct memory leak (apache#581)"

This reverts commit 7dcdcdc.
  • Loading branch information
shangxinli committed Feb 14, 2019
1 parent 0e4489e commit d0fd420
Show file tree
Hide file tree
Showing 67 changed files with 586 additions and 254 deletions.
30 changes: 1 addition & 29 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,34 +1,6 @@
language: java
before_install:
- date
- sudo apt-get update -qq
- sudo apt-get install build-essential
- sudo apt-get install pv
- date
- mkdir protobuf_install
- pushd protobuf_install
- wget https://github.com/google/protobuf/archive/v3.5.1.tar.gz -O protobuf-3.5.1.tar.gz
- tar xzf protobuf-3.5.1.tar.gz
- cd protobuf-3.5.1
- sudo apt-get install autoconf automake libtool curl make g++ unzip
- ./autogen.sh
- ./configure
- make
- sudo make install
- sudo ldconfig
- protoc --version
- popd
- date
- pwd
- sudo apt-get install -qq libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
- wget -nv http://archive.apache.org/dist/thrift/0.9.3/thrift-0.9.3.tar.gz
- tar zxf thrift-0.9.3.tar.gz
- cd thrift-0.9.3
- chmod +x ./configure
- ./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang --without-php --without-nodejs
- sudo make install
- cd ..
- date
- bash dev/travis-before_install.sh

env:
- HADOOP_PROFILE=default TEST_CODECS=uncompressed,brotli
Expand Down
44 changes: 41 additions & 3 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@

# Parquet #

### Version 1.11.0-encr ###
### Version 1.11.3-encr ###

Release Notes - Parquet - Version 1.11.0-encr
Release Notes - Parquet - Version 1.11.3-encr

#### Bug

* [PARQUET-1364](https://issues.apache.org/jira/browse/PARQUET-1364) - Column Indexes: Invalid row indexes for pages starting with nulls
* [PARQUET-138](https://issues.apache.org/jira/browse/PARQUET-138) - Parquet should allow a merge between required and optional schemas
* [PARQUET-952](https://issues.apache.org/jira/browse/PARQUET-952) - Avro union with single type fails with 'is not a group'
* [PARQUET-1128](https://issues.apache.org/jira/browse/PARQUET-1128) - \[Java\] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter
* [PARQUET-1285](https://issues.apache.org/jira/browse/PARQUET-1285) - \[Java\] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow
Expand All @@ -51,6 +52,17 @@ Release Notes - Parquet - Version 1.11.0-encr
* [PARQUET-1461](https://issues.apache.org/jira/browse/PARQUET-1461) - Third party code does not compile after parquet-mr minor version update
* [PARQUET-1472](https://issues.apache.org/jira/browse/PARQUET-1472) - Dictionary filter fails on FIXED\_LEN\_BYTE\_ARRAY
* [PARQUET-1478](https://issues.apache.org/jira/browse/PARQUET-1478) - Can't read spec compliant, 3-level lists via parquet-proto
* [PARQUET-1470](https://issues.apache.org/jira/browse/PARQUET-1470) - Inputstream leakage in ParquetFileWriter.appendFile
* [PARQUET-1472](https://issues.apache.org/jira/browse/PARQUET-1472) - Dictionary filter fails on FIXED\_LEN\_BYTE\_ARRAY
* [PARQUET-1475](https://issues.apache.org/jira/browse/PARQUET-1475) - DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor
* [PARQUET-1478](https://issues.apache.org/jira/browse/PARQUET-1478) - Can't read spec compliant, 3-level lists via parquet-proto
* [PARQUET-1480](https://issues.apache.org/jira/browse/PARQUET-1480) - INT96 to avro not yet implemented error should mention deprecation
* [PARQUET-1485](https://issues.apache.org/jira/browse/PARQUET-1485) - Snappy Decompressor/Compressor may cause direct memory leak
* [PARQUET-1498](https://issues.apache.org/jira/browse/PARQUET-1498) - \[Java\] Add instructions to install thrift via homebrew
* [PARQUET-1510](https://issues.apache.org/jira/browse/PARQUET-1510) - Dictionary filter skips null values when evaluating not-equals.
* [PARQUET-1514](https://issues.apache.org/jira/browse/PARQUET-1514) - ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes
* [PARQUET-1527](https://issues.apache.org/jira/browse/PARQUET-1527) - \[parquet-tools\] cat command throw java.lang.ClassCastException
* [PARQUET-1529](https://issues.apache.org/jira/browse/PARQUET-1529) - Shade fastutil in all modules where used

#### New Feature

Expand All @@ -61,6 +73,7 @@ Release Notes - Parquet - Version 1.11.0-encr

#### Improvement

* [PARQUET-1280](https://issues.apache.org/jira/browse/PARQUET-1280) - \[parquet-protobuf\] Use maven protoc plugin
* [PARQUET-1321](https://issues.apache.org/jira/browse/PARQUET-1321) - LogicalTypeAnnotation.LogicalTypeAnnotationVisitor#visit methods should have a return value
* [PARQUET-1335](https://issues.apache.org/jira/browse/PARQUET-1335) - Logical type names in parquet-mr are not consistent with parquet-format
* [PARQUET-1336](https://issues.apache.org/jira/browse/PARQUET-1336) - PrimitiveComparator should implements Serializable
Expand All @@ -73,17 +86,42 @@ Release Notes - Parquet - Version 1.11.0-encr
* [PARQUET-1418](https://issues.apache.org/jira/browse/PARQUET-1418) - Run integration tests in Travis
* [PARQUET-1435](https://issues.apache.org/jira/browse/PARQUET-1435) - Benchmark filtering column-indexes
* [PARQUET-1462](https://issues.apache.org/jira/browse/PARQUET-1462) - Allow specifying new development version in prepare-release.sh
* [PARQUET-1466](https://issues.apache.org/jira/browse/PARQUET-1466) - Upgrade to the latest guava 27.0-jre
* [PARQUET-1474](https://issues.apache.org/jira/browse/PARQUET-1474) - Less verbose and lower level logging for missing column/offset indexes
* [PARQUET-1476](https://issues.apache.org/jira/browse/PARQUET-1476) - Don't emit a warning message for files without new logical type
* [PARQUET-1487](https://issues.apache.org/jira/browse/PARQUET-1487) - Do not write original type for timezone-agnostic timestamps
* [PARQUET-1489](https://issues.apache.org/jira/browse/PARQUET-1489) - Insufficient documentation for UserDefinedPredicate.keep(T)

* [PARQUET-1490](https://issues.apache.org/jira/browse/PARQUET-1490) - Add branch-specific Travis steps
* [PARQUET-1492](https://issues.apache.org/jira/browse/PARQUET-1492) - Remove protobuf install in travis build
* [PARQUET-1500](https://issues.apache.org/jira/browse/PARQUET-1500) - Remove the Closables
* [PARQUET-1502](https://issues.apache.org/jira/browse/PARQUET-1502) - Convert FIXED\_LEN\_BYTE\_ARRAY to arrow type in
* [PARQUET-1503](https://issues.apache.org/jira/browse/PARQUET-1503) - Remove Ints Utility Class
* [PARQUET-1504](https://issues.apache.org/jira/browse/PARQUET-1504) - Add an option to convert Parquet Int96 to Arrow Timestamp
* [PARQUET-1505](https://issues.apache.org/jira/browse/PARQUET-1505) - Use Java 7 NIO StandardCharsets
* [PARQUET-1506](https://issues.apache.org/jira/browse/PARQUET-1506) - Migrate from maven-thrift-plugin to thrift-maven-plugin
* [PARQUET-1507](https://issues.apache.org/jira/browse/PARQUET-1507) - Bump Apache Thrift to 0.12.0
* [PARQUET-1509](https://issues.apache.org/jira/browse/PARQUET-1509) - Update Docs for Hive Deprecation
* [PARQUET-1513](https://issues.apache.org/jira/browse/PARQUET-1513) - HiddenFileFilter Streamline
* [PARQUET-1518](https://issues.apache.org/jira/browse/PARQUET-1518) - Bump Jackson2 version of parquet-cli

#### Task

* [PARQUET-968](https://issues.apache.org/jira/browse/PARQUET-968) - Add Hive/Presto support in ProtoParquet
* [PARQUET-1436](https://issues.apache.org/jira/browse/PARQUET-1436) - TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970
* [PARQUET-1452](https://issues.apache.org/jira/browse/PARQUET-1452) - Deprecate old logical types API
* [PARQUET-1294](https://issues.apache.org/jira/browse/PARQUET-1294) - Update release scripts for the new Apache policy
* [PARQUET-1434](https://issues.apache.org/jira/browse/PARQUET-1434) - Release parquet-mr 1.11.0
* [PARQUET-1436](https://issues.apache.org/jira/browse/PARQUET-1436) - TimestampMicrosStringifier shows wrong microseconds for timestamps before 1970
* [PARQUET-1452](https://issues.apache.org/jira/browse/PARQUET-1452) - Deprecate old logical types API

### Version 1.10.1 ###

Release Notes - Parquet - Version 1.10.1

#### Bug

* [PARQUET-1510](https://issues.apache.org/jira/browse/PARQUET-1510) \- Dictionary filter skips null values when evaluating not-equals.
* [PARQUET-1309](https://issues.apache.org/jira/browse/PARQUET-1309) \- Parquet Java uses incorrect stats and dictionary filter properties

### Version 1.10.0 ###

Expand Down
33 changes: 14 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,35 +28,28 @@ You can find some details about the format and intended use cases in our [Hadoop

## Building

Parquet-MR uses Maven to build and depends on both the thrift and protoc compilers.

### Install Protobuf

To build and install the protobuf compiler, run:

```
wget https://github.com/google/protobuf/archive/v3.5.1.tar.gz -O protobuf-3.5.1.tar.gz
tar xzf protobuf-3.5.1.tar.gz
cd protobuf-3.5.1
./configure
make
sudo make install
sudo ldconfig
```
Parquet-MR uses Maven to build and depends on the thrift compiler (protoc is now managed by maven plugin).

### Install Thrift

To build and install the thrift compiler, run:

```
wget -nv http://archive.apache.org/dist/thrift/0.9.3/thrift-0.9.3.tar.gz
tar xzf thrift-0.9.3.tar.gz
cd thrift-0.9.3
wget -nv http://archive.apache.org/dist/thrift/0.12.0/thrift-0.12.0.tar.gz
tar xzf thrift-0.12.0.tar.gz
cd thrift-0.12.0
chmod +x ./configure
./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang --without-php --without-nodejs
sudo make install
```

If you're on OSX and use homebrew, you can instead install Thrift 0.12.0 with `brew` and ensure that it comes first in your `PATH`.

```
brew install thrift@0.12.0
export PATH="/usr/local/opt/thrift@0.12.0/bin:$PATH"
```

### Build Parquet with Maven

Once protobuf and thrift are available in your path, you can build the project by running:
Expand All @@ -71,7 +64,7 @@ Parquet is a very active project, and new features are being added quickly. Here


* Type-specific encoding
* Hive integration
* Hive integration (deprecated)
* Pig integration
* Cascading integration
* Crunch integration
Expand Down Expand Up @@ -134,6 +127,8 @@ If the data was stored using Pig, things will "just work". If the data was store

Hive integration is provided via the [parquet-hive](https://github.com/apache/parquet-mr/tree/master/parquet-hive) sub-project.

Hive integration is now deprecated within the Parquet project. It is now maintained by Apache Hive.

## Build

To run the unit tests: `mvn test`
Expand Down
50 changes: 50 additions & 0 deletions dev/travis-before_install-master.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

################################################################################
# This is a branch-specific script that gets invoked at the end of
# travis-before_install.sh. It is run for the master branch only.
################################################################################

fail_the_build=
reduced_pom="$(tempfile)"
shopt -s globstar # Enables ** to match files in subdirectories recursively
for pom in **/pom.xml
do
# Removes the project/version and project/parent/version elements, because
# those are allowed to have SNAPSHOT in them. Also removes comments.
xmlstarlet ed -N pom='http://maven.apache.org/POM/4.0.0' \
-d '/pom:project/pom:version|/pom:project/pom:parent/pom:version|//comment()' "$pom" > "$reduced_pom"
if grep -q SNAPSHOT "$reduced_pom"
then
if [[ ! "$fail_the_build" ]]
then
printf "Error: POM files in the master branch can not refer to SNAPSHOT versions.\n"
fail_the_build=YES
fi
printf "\nOffending POM file: %s\nOffending content:\n" "$pom"
# Removes every element that does not have SNAPSHOT in it or its
# descendants. As a result, we get a skeleton of the POM file with only the
# offending parts.
xmlstarlet ed -d "//*[count((.|.//*)[contains(text(), 'SNAPSHOT')]) = 0]" "$reduced_pom"
fi
done
rm "$reduced_pom"
if [[ "$fail_the_build" ]]
then
exit 1
fi
44 changes: 44 additions & 0 deletions dev/travis-before_install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

################################################################################
# This script gets invoked by .travis.yml in the before_install step
################################################################################

export THIFT_VERSION=0.12.0

set -e
date
sudo apt-get update -qq
sudo apt-get install -qq build-essential pv autoconf automake libtool curl make \
g++ unzip libboost-dev libboost-test-dev libboost-program-options-dev \
libevent-dev automake libtool flex bison pkg-config g++ libssl-dev xmlstarlet
date
pwd
wget -nv http://archive.apache.org/dist/thrift/${THIFT_VERSION}/thrift-${THIFT_VERSION}.tar.gz
tar zxf thrift-${THIFT_VERSION}.tar.gz
cd thrift-${THIFT_VERSION}
chmod +x ./configure
./configure --disable-gen-erl --disable-gen-hs --without-ruby --without-haskell --without-erlang --without-php --without-nodejs
sudo make install
cd ..
branch_specific_script="dev/travis-before_install-${TRAVIS_BRANCH}.sh"
if [[ -e "$branch_specific_script" ]]
then
. "$branch_specific_script"
fi
date
2 changes: 1 addition & 1 deletion parquet-arrow/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<groupId>org.apache.parquet</groupId>
<artifactId>parquet</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.11.2-encr</version>
<version>1.11.3-encr</version>
</parent>

<modelVersion>4.0.0</modelVersion>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,19 @@
*/
public class SchemaConverter {

// Indicates if Int96 should be converted to Arrow Timestamp
private final boolean convertInt96ToArrowTimestamp;

/**
* For when we'll need this to be configurable
*/
public SchemaConverter() {
this(false);
}

// TODO(PARQUET-1511): pass the parameters in a configuration object
public SchemaConverter(final boolean convertInt96ToArrowTimestamp) {
this.convertInt96ToArrowTimestamp = convertInt96ToArrowTimestamp;
}

/**
Expand Down Expand Up @@ -492,13 +501,26 @@ private String getTimeZone(LogicalTypeAnnotation.TimestampLogicalTypeAnnotation

@Override
public TypeMapping convertINT96(PrimitiveTypeName primitiveTypeName) throws RuntimeException {
// Possibly timestamp
return field(new ArrowType.Binary());
if (convertInt96ToArrowTimestamp) {
return field(new ArrowType.Timestamp(TimeUnit.NANOSECOND, null));
} else {
return field(new ArrowType.Binary());
}
}

@Override
public TypeMapping convertFIXED_LEN_BYTE_ARRAY(PrimitiveTypeName primitiveTypeName) throws RuntimeException {
return field(new ArrowType.Binary());
LogicalTypeAnnotation logicalTypeAnnotation = type.getLogicalTypeAnnotation();
if (logicalTypeAnnotation == null) {
return field(new ArrowType.Binary());
}

return logicalTypeAnnotation.accept(new LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<TypeMapping>() {
@Override
public Optional<TypeMapping> visit(LogicalTypeAnnotation.DecimalLogicalTypeAnnotation decimalLogicalType) {
return of(decimal(decimalLogicalType.getPrecision(), decimalLogicalType.getScale()));
}
}).orElseThrow(() -> new IllegalArgumentException("illegal type " + type));
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.FLOAT;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT32;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT64;
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.INT96;

import java.io.IOException;
import java.util.List;
Expand Down Expand Up @@ -419,6 +420,47 @@ public void testParquetInt64TimeMicrosToArrow() {
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetFixedBinaryToArrow() {
MessageType parquet = Types.buildMessage()
.addField(Types.optional(FIXED_LEN_BYTE_ARRAY).length(12).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Binary())
));
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetFixedBinaryToArrowDecimal() {
MessageType parquet = Types.buildMessage()
.addField(Types.optional(FIXED_LEN_BYTE_ARRAY).length(5).as(DECIMAL).precision(8).scale(2).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Decimal(8, 2))
));
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetInt96ToArrowBinary() {
MessageType parquet = Types.buildMessage()
.addField(Types.optional(INT96).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Binary())
));
Assert.assertEquals(expected, converter.fromParquet(parquet).getArrowSchema());
}

@Test
public void testParquetInt96ToArrowTimestamp() {
final SchemaConverter converterInt96ToTimestamp = new SchemaConverter(true);
MessageType parquet = Types.buildMessage()
.addField(Types.optional(INT96).named("a")).named("root");
Schema expected = new Schema(asList(
field("a", new ArrowType.Timestamp(TimeUnit.NANOSECOND, null))
));
Assert.assertEquals(expected, converterInt96ToTimestamp.fromParquet(parquet).getArrowSchema());
}

@Test(expected = IllegalStateException.class)
public void testParquetInt64TimeMillisToArrow() {
converter.fromParquet(Types.buildMessage()
Expand Down
Loading

0 comments on commit d0fd420

Please sign in to comment.