[iceberg] Add UUID type support #23627

ZacBlanco · 2024-09-11T18:15:09Z

Description

This PR adds support for reading and writing UUIDs on Iceberg tables with all available catalogs. In order to support this we also needed improvements to the parquet reader and writer for Parquet's UUID logical type.

Motivation and Context

Presto has type support for UUIDs. We should support reading and writing them in some of the connectors.

Impact

Iceberg tables can now be created with UUID types.

Test Plan

Basic tests inside of the Iceberg module for round-trip UUID reading and writing.
Additional tests in the parquet module for reading and writing UUID values

I also added a benchmark for reading and writing UUID types and compared it to our current LongDecimal benchmark to see the performance difference for another type which uses FIXED_LENGTH_BYTE_ARRAY as the underlying physical type.

These were the microbenchmarks from my local machine on ARM using a build of Corretto JDK11 and with reader verification disabled

Benchmark                       (batchReaderEnabled)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                  true              PLAIN  thrpt   20  265.542 ± 3.891  ops/s
BenchmarkUuidColumnReader.read                  true   DELTA_BYTE_ARRAY  thrpt   20   35.511 ± 0.598  ops/s
BenchmarkUuidColumnReader.read                 false              PLAIN  thrpt   20   27.316 ± 1.222  ops/s
BenchmarkUuidColumnReader.read                 false   DELTA_BYTE_ARRAY  thrpt   20   20.883 ± 0.521  ops/s


Benchmark                              (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt   Score   Error  Units
BenchmarkLongDecimalColumnReader.read                     true              PLAIN  thrpt   20  10.926 ± 0.178  ops/s
BenchmarkLongDecimalColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   20   8.948 ± 0.180  ops/s
BenchmarkLongDecimalColumnReader.read                    false              PLAIN  thrpt   20   8.632 ± 0.129  ops/s
BenchmarkLongDecimalColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   20   7.771 ± 0.066  ops/s

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

steveburnett · 2024-09-13T20:57:32Z

Nit suggestion for the release note entry to follow the Order of changes in the Release Notes Guidelines:

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

ZacBlanco · 2024-09-13T20:58:37Z

Fixed, thanks Steve!

presto-common/src/main/java/com/facebook/presto/common/type/Uuids.java

elharo · 2024-09-23T22:21:18Z

presto-hive-metastore/src/main/java/com/facebook/presto/hive/HiveType.java

+        @Override
+        public int hashCode()
+        {
+            return UUID.hashCode();


so this is a singleton? If so the equals method can be simpler. Otherwise this hashCode method is incorrect

TypeInfo is a hive-specific class that I am implementing because it does not natively support UUIDs. This implementation follows the same primitive paradigm as the other PrimitiveTypeInfo which exist in the Apache Hive's implementation

https://github.com/apache/hive/blame/13ec7c3ab47001df25e7be87739731192075b0a7/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/PrimitiveTypeInfo.java#L93-L112

If I'm reading this right the hashCode is a constant, which only makes sense if the object is a singleton. Am I missing something? (always possible)

I am just matching the same implementation used by the rest of the TypeInfo class implementations to maintain compatibility with Hive. I don't think we should be deviating from their implementation style at risk of introducing unknown issues.

presto-iceberg/src/main/java/com/facebook/presto/iceberg/ExpressionConverter.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

elharo · 2024-09-23T22:24:30Z

...parquet/src/test/java/com/facebook/presto/parquet/batchreader/decoders/TestParquetUtils.java

+                    long value = random.nextLong();
+                    writer.writeLong(value);
+                    addedValues.add(value);
+                    value = random.nextLong();


no random values please; pick an arbitrary one

The entire test class revolves around generating random values for different primitive types. I don't think we should go against the grain here. Changing this entire class to not use random values is out of scope for this PR.

I'm not asking for the entire class to change. I'm asking that this PR not make the problem worse. Random values in unit tests are a major source of flakiness. Please don't do this.

I am not going to change this because it requires re-writing a lots of boilerplate of code to perform the same assertions. I filed an issue for someone to fix this class in the future: #23840

elharo · 2024-09-24T11:55:30Z

presto-hive-metastore/src/main/java/com/facebook/presto/hive/HiveType.java

+        @Override
+        public int hashCode()
+        {
+            return UUID.hashCode();


If I'm reading this right the hashCode is a constant, which only makes sense if the object is a singleton. Am I missing something? (always possible)

elharo · 2024-09-24T11:57:11Z

...parquet/src/test/java/com/facebook/presto/parquet/batchreader/decoders/TestParquetUtils.java

+                    long value = random.nextLong();
+                    writer.writeLong(value);
+                    addedValues.add(value);
+                    value = random.nextLong();


I'm not asking for the entire class to change. I'm asking that this PR not make the problem worse. Random values in unit tests are a major source of flakiness. Please don't do this.

hantangwangd

Thanks for support UUID type in Iceberg connector. Should we add the mapping of UUID between PrestoDB type and Iceberg type to the chapter Type mapping in iceberg document?

steveburnett

LGTM! (docs)

Pull branch, local doc build, looks good. Thanks!

hantangwangd

The change overall looks good to me. Some nits and little problems.

presto-parquet/src/main/java/com/facebook/presto/parquet/cache/MetadataReader.java

...o-parquet/src/main/java/com/facebook/presto/parquet/writer/valuewriter/UuidValuesWriter.java

presto-parquet/src/main/java/com/facebook/presto/parquet/ColumnReaderFactory.java

...ebook/presto/parquet/batchreader/decoders/plain/FixedLenByteArrayUuidPlainValuesDecoder.java

...ava/com/facebook/presto/parquet/batchreader/decoders/rle/UuidRLEDictionaryValuesDecoder.java

...o-parquet/src/main/java/com/facebook/presto/parquet/writer/valuewriter/UuidValuesWriter.java

ZacBlanco · 2024-10-07T22:23:15Z

FYI @elharo since #23699 was merged I needed to add a profile which uses the previously removed maven plugin to generate the batch readers which is now reflected in the 1st commit of this PR. Leaving the templates in the repository is not enough to re-generate the code.

elharo · 2024-10-07T23:14:00Z

...to-parquet/src/main/java/com/facebook/presto/parquet/batchreader/BooleanFlatBatchReader.java

@@ -255,3 +255,4 @@ private void seek()
        }
    }
 }
+


adding these blank lines should be a separate PR, if it's needed at all. (If checkstyle doesn't complain it probably isn't)

They are caught by checkstyle. They are added automatically by the templating tool which generates the files, but weren't caught before because they existed in generated-sources. I tried a large combination of whitespace-control directives from freemarker but was unsuccessful in removing them. I can split these changes into a separate commit on this PR. The commit will be preserved when this PR merges.

Let me know if you think that is acceptable

yingsu00

Haven't finished reviewing, will continue later tonight.

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/BinaryColumnReader.java

presto-parquet/.gitignore

...ebook/presto/parquet/batchreader/decoders/plain/FixedLenByteArrayUuidPlainValuesDecoder.java

presto-common/src/main/java/com/facebook/presto/common/type/Uuids.java

...ava/com/facebook/presto/parquet/batchreader/decoders/rle/UuidRLEDictionaryValuesDecoder.java

presto-parquet/src/test/java/com/facebook/presto/parquet/reader/BenchmarkUuidColumnReader.java

Previously, the code generation step was removed in 7c814ae However, this makes it more difficult to add new readers in the future, such as for UUIDs. This change adds the code generation step as an optional profile which can be enabled when building the parquet module through -PgenerateParquetReaders

ZacBlanco · 2024-10-11T20:50:09Z

Thanks for your detailed review @yingsu00 . Here are the new performance numbers from the benchmark

original

Benchmark                       (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                     true              PLAIN  thrpt   10  158.193 ± 1.058  ops/s
BenchmarkUuidColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   10   32.211 ± 0.345  ops/s
BenchmarkUuidColumnReader.read                    false              PLAIN  thrpt   10   13.030 ± 0.463  ops/s
BenchmarkUuidColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   10   10.121 ± 0.377  ops/s

previous update (+50-100% in non-batch reader from original)

Benchmark                       (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                     true              PLAIN  thrpt   20  161.496 ± 1.250  ops/s
BenchmarkUuidColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   20   33.104 ± 0.257  ops/s
BenchmarkUuidColumnReader.read                    false              PLAIN  thrpt   20   27.478 ± 1.132  ops/s
BenchmarkUuidColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   20   20.779 ± 0.564  ops/s

newest update (+62% improvement in plain batch reader)

Benchmark                       (batchReaderEnabled)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                  true              PLAIN  thrpt   20  265.542 ± 3.891  ops/s
BenchmarkUuidColumnReader.read                  true   DELTA_BYTE_ARRAY  thrpt   20   35.511 ± 0.598  ops/s
BenchmarkUuidColumnReader.read                 false              PLAIN  thrpt   20   27.316 ± 1.222  ops/s
BenchmarkUuidColumnReader.read                 false   DELTA_BYTE_ARRAY  thrpt   20   20.883 ± 0.521  ops/s

The iceberg spec lists uuid as a valid schema type. Presto supports UUID types but there was no support for reading or writing them in the connector. This commit makes the necessary changes in the connector to create tables with UUID columns and support for UUIDs in the parquet reader. This includes an implementation for UUIDs in the batchreader.

yingsu00

Looks good!

ZacBlanco force-pushed the upstream-iceberg-uuid branch 4 times, most recently from 49f7ab3 to 5343fb6 Compare September 13, 2024 20:36

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 5343fb6 to 70b6f07 Compare September 23, 2024 14:45

ZacBlanco marked this pull request as ready for review September 23, 2024 15:44

ZacBlanco requested review from elharo, a team, shangxinli and hantangwangd as code owners September 23, 2024 15:44

ZacBlanco requested a review from presto-oss September 23, 2024 15:44

elharo requested changes Sep 23, 2024

View reviewed changes

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 70b6f07 to 29fa113 Compare September 24, 2024 00:55

elharo requested changes Sep 24, 2024

View reviewed changes

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 29fa113 to fb53070 Compare September 24, 2024 20:52

hantangwangd reviewed Sep 26, 2024

View reviewed changes

ZacBlanco requested a review from yingsu00 September 26, 2024 20:22

ZacBlanco force-pushed the upstream-iceberg-uuid branch from fb53070 to 7be1ecb Compare September 27, 2024 15:48

ZacBlanco requested a review from steveburnett as a code owner September 27, 2024 15:48

steveburnett previously approved these changes Sep 27, 2024

View reviewed changes

ZacBlanco requested a review from agrawalreetika September 27, 2024 16:19

hantangwangd reviewed Sep 28, 2024

View reviewed changes

ZacBlanco dismissed steveburnett’s stale review via e1f0034 October 1, 2024 16:53

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 7be1ecb to e1f0034 Compare October 1, 2024 16:53

hantangwangd previously approved these changes Oct 3, 2024

View reviewed changes

agrawalreetika requested changes Oct 4, 2024

View reviewed changes

...o-parquet/src/main/java/com/facebook/presto/parquet/writer/valuewriter/UuidValuesWriter.java Show resolved Hide resolved

agrawalreetika reviewed Oct 7, 2024

View reviewed changes

...o-parquet/src/main/java/com/facebook/presto/parquet/writer/valuewriter/UuidValuesWriter.java Outdated Show resolved Hide resolved

ZacBlanco dismissed hantangwangd’s stale review via 9bd2657 October 7, 2024 18:40

ZacBlanco force-pushed the upstream-iceberg-uuid branch 2 times, most recently from 9bd2657 to e002737 Compare October 7, 2024 20:34

tdcmeehan self-assigned this Oct 7, 2024

ZacBlanco force-pushed the upstream-iceberg-uuid branch from e002737 to 74ac591 Compare October 7, 2024 21:56

elharo requested changes Oct 7, 2024

View reviewed changes

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 74ac591 to 5c9def7 Compare October 8, 2024 23:09

hantangwangd previously approved these changes Oct 9, 2024

View reviewed changes

agrawalreetika previously approved these changes Oct 9, 2024

View reviewed changes

yingsu00 reviewed Oct 11, 2024

View reviewed changes

presto-parquet/src/main/java/com/facebook/presto/parquet/reader/BinaryColumnReader.java Outdated Show resolved Hide resolved

presto-parquet/.gitignore Show resolved Hide resolved

ZacBlanco dismissed stale reviews from agrawalreetika and hantangwangd via 891af15 October 11, 2024 05:31

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 5c9def7 to 891af15 Compare October 11, 2024 05:31

yingsu00 reviewed Oct 11, 2024

View reviewed changes

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 891af15 to 069e2b5 Compare October 11, 2024 20:46

ZacBlanco force-pushed the upstream-iceberg-uuid branch from 069e2b5 to c8ec2e4 Compare October 12, 2024 06:33

ZacBlanco added 2 commits October 14, 2024 13:56

Rename enabledOptimizedReader -> enableBatchReader

365b88e

ZacBlanco force-pushed the upstream-iceberg-uuid branch from c8ec2e4 to 365b88e Compare October 14, 2024 21:02

ZacBlanco requested a review from elharo October 15, 2024 23:00

yingsu00 approved these changes Oct 15, 2024

View reviewed changes

hantangwangd approved these changes Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iceberg] Add UUID type support #23627

[iceberg] Add UUID type support #23627

ZacBlanco commented Sep 11, 2024 •

edited

Loading

steveburnett commented Sep 13, 2024

ZacBlanco commented Sep 13, 2024

elharo Sep 23, 2024

ZacBlanco Sep 24, 2024

elharo Sep 24, 2024

ZacBlanco Oct 15, 2024 •

edited

Loading

elharo Sep 23, 2024

ZacBlanco Sep 24, 2024

elharo Sep 24, 2024

ZacBlanco Oct 15, 2024

elharo Sep 24, 2024

elharo Sep 24, 2024

hantangwangd left a comment

steveburnett left a comment

hantangwangd left a comment

ZacBlanco commented Oct 7, 2024 •

edited

Loading

elharo Oct 7, 2024

ZacBlanco Oct 7, 2024 •

edited

Loading

yingsu00 left a comment

ZacBlanco commented Oct 11, 2024 •

edited

Loading

yingsu00 left a comment

@@ @@ -255,3 +255,4 @@ private void seek() @@
                       }
                   }
               }

[iceberg] Add UUID type support #23627

Are you sure you want to change the base?

[iceberg] Add UUID type support #23627

Conversation

ZacBlanco commented Sep 11, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

steveburnett commented Sep 13, 2024

ZacBlanco commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZacBlanco Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hantangwangd left a comment

Choose a reason for hiding this comment

steveburnett left a comment

Choose a reason for hiding this comment

hantangwangd left a comment

Choose a reason for hiding this comment

ZacBlanco commented Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

ZacBlanco Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

yingsu00 left a comment

Choose a reason for hiding this comment

ZacBlanco commented Oct 11, 2024 • edited Loading

yingsu00 left a comment

Choose a reason for hiding this comment

ZacBlanco commented Sep 11, 2024 •

edited

Loading

ZacBlanco Oct 15, 2024 •

edited

Loading

ZacBlanco commented Oct 7, 2024 •

edited

Loading

ZacBlanco Oct 7, 2024 •

edited

Loading

ZacBlanco commented Oct 11, 2024 •

edited

Loading