Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iceberg] Add UUID type support #23627

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Sep 11, 2024

Description

This PR adds support for reading and writing UUIDs on Iceberg tables with all available catalogs. In order to support this we also needed improvements to the parquet reader and writer for Parquet's UUID logical type.

Motivation and Context

Presto has type support for UUIDs. We should support reading and writing them in some of the connectors.

Impact

  • Iceberg tables can now be created with UUID types.

Test Plan

  • Basic tests inside of the Iceberg module for round-trip UUID reading and writing.
  • Additional tests in the parquet module for reading and writing UUID values

I also added a benchmark for reading and writing UUID types and compared it to our current LongDecimal benchmark to see the performance difference for another type which uses FIXED_LENGTH_BYTE_ARRAY as the underlying physical type.

These were the microbenchmarks from my local machine on ARM using a build of Corretto JDK11 and with reader verification disabled

Benchmark                       (batchReaderEnabled)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                  true              PLAIN  thrpt   20  265.542 ± 3.891  ops/s
BenchmarkUuidColumnReader.read                  true   DELTA_BYTE_ARRAY  thrpt   20   35.511 ± 0.598  ops/s
BenchmarkUuidColumnReader.read                 false              PLAIN  thrpt   20   27.316 ± 1.222  ops/s
BenchmarkUuidColumnReader.read                 false   DELTA_BYTE_ARRAY  thrpt   20   20.883 ± 0.521  ops/s


Benchmark                              (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt   Score   Error  Units
BenchmarkLongDecimalColumnReader.read                     true              PLAIN  thrpt   20  10.926 ± 0.178  ops/s
BenchmarkLongDecimalColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   20   8.948 ± 0.180  ops/s
BenchmarkLongDecimalColumnReader.read                    false              PLAIN  thrpt   20   8.632 ± 0.129  ops/s
BenchmarkLongDecimalColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   20   7.771 ± 0.066  ops/s

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-uuid branch 4 times, most recently from 49f7ab3 to 5343fb6 Compare September 13, 2024 20:36
@steveburnett
Copy link
Contributor

Nit suggestion for the release note entry to follow the Order of changes in the Release Notes Guidelines:

== RELEASE NOTES ==

General Changes
* Add UUID type support to the Parquet reader and writer. :pr:`23627`

Iceberg Connector Changes
* Add support of UUID-typed columns :pr:`23627`

@ZacBlanco
Copy link
Contributor Author

Fixed, thanks Steve!

@ZacBlanco ZacBlanco marked this pull request as ready for review September 23, 2024 15:44
@Override
public int hashCode()
{
return UUID.hashCode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a singleton? If so the equals method can be simpler. Otherwise this hashCode method is incorrect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeInfo is a hive-specific class that I am implementing because it does not natively support UUIDs. This implementation follows the same primitive paradigm as the other PrimitiveTypeInfo which exist in the Apache Hive's implementation

https://github.com/apache/hive/blame/13ec7c3ab47001df25e7be87739731192075b0a7/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/PrimitiveTypeInfo.java#L93-L112

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right the hashCode is a constant, which only makes sense if the object is a singleton. Am I missing something? (always possible)

Copy link
Contributor Author

@ZacBlanco ZacBlanco Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just matching the same implementation used by the rest of the TypeInfo class implementations to maintain compatibility with Hive. I don't think we should be deviating from their implementation style at risk of introducing unknown issues.

long value = random.nextLong();
writer.writeLong(value);
addedValues.add(value);
value = random.nextLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no random values please; pick an arbitrary one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire test class revolves around generating random values for different primitive types. I don't think we should go against the grain here. Changing this entire class to not use random values is out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not asking for the entire class to change. I'm asking that this PR not make the problem worse. Random values in unit tests are a major source of flakiness. Please don't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not going to change this because it requires re-writing a lots of boilerplate of code to perform the same assertions. I filed an issue for someone to fix this class in the future: #23840

@Override
public int hashCode()
{
return UUID.hashCode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right the hashCode is a constant, which only makes sense if the object is a singleton. Am I missing something? (always possible)

long value = random.nextLong();
writer.writeLong(value);
addedValues.add(value);
value = random.nextLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not asking for the entire class to change. I'm asking that this PR not make the problem worse. Random values in unit tests are a major source of flakiness. Please don't do this.

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for support UUID type in Iceberg connector. Should we add the mapping of UUID between PrestoDB type and Iceberg type to the chapter Type mapping in iceberg document?

steveburnett
steveburnett previously approved these changes Sep 27, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build, looks good. Thanks!

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change overall looks good to me. Some nits and little problems.

hantangwangd
hantangwangd previously approved these changes Oct 3, 2024
@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Oct 7, 2024

FYI @elharo since #23699 was merged I needed to add a profile which uses the previously removed maven plugin to generate the batch readers which is now reflected in the 1st commit of this PR. Leaving the templates in the repository is not enough to re-generate the code.

@@ -255,3 +255,4 @@ private void seek()
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding these blank lines should be a separate PR, if it's needed at all. (If checkstyle doesn't complain it probably isn't)

Copy link
Contributor Author

@ZacBlanco ZacBlanco Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are caught by checkstyle. They are added automatically by the templating tool which generates the files, but weren't caught before because they existed in generated-sources. I tried a large combination of whitespace-control directives from freemarker but was unsuccessful in removing them. I can split these changes into a separate commit on this PR. The commit will be preserved when this PR merges.

Let me know if you think that is acceptable

hantangwangd
hantangwangd previously approved these changes Oct 9, 2024
agrawalreetika
agrawalreetika previously approved these changes Oct 9, 2024
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished reviewing, will continue later tonight.

Previously, the code generation step was removed in 7c814ae
However, this makes it more difficult to add new readers in the
future, such as for UUIDs. This change adds the code generation step
as an optional profile which can be enabled when building the
parquet module through -PgenerateParquetReaders
@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Oct 11, 2024

Thanks for your detailed review @yingsu00 . Here are the new performance numbers from the benchmark

original

Benchmark                       (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                     true              PLAIN  thrpt   10  158.193 ± 1.058  ops/s
BenchmarkUuidColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   10   32.211 ± 0.345  ops/s
BenchmarkUuidColumnReader.read                    false              PLAIN  thrpt   10   13.030 ± 0.463  ops/s
BenchmarkUuidColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   10   10.121 ± 0.377  ops/s

previous update (+50-100% in non-batch reader from original)

Benchmark                       (enableOptimizedReader)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                     true              PLAIN  thrpt   20  161.496 ± 1.250  ops/s
BenchmarkUuidColumnReader.read                     true   DELTA_BYTE_ARRAY  thrpt   20   33.104 ± 0.257  ops/s
BenchmarkUuidColumnReader.read                    false              PLAIN  thrpt   20   27.478 ± 1.132  ops/s
BenchmarkUuidColumnReader.read                    false   DELTA_BYTE_ARRAY  thrpt   20   20.779 ± 0.564  ops/s

newest update (+62% improvement in plain batch reader)

Benchmark                       (batchReaderEnabled)  (parquetEncoding)   Mode  Cnt    Score   Error  Units
BenchmarkUuidColumnReader.read                  true              PLAIN  thrpt   20  265.542 ± 3.891  ops/s
BenchmarkUuidColumnReader.read                  true   DELTA_BYTE_ARRAY  thrpt   20   35.511 ± 0.598  ops/s
BenchmarkUuidColumnReader.read                 false              PLAIN  thrpt   20   27.316 ± 1.222  ops/s
BenchmarkUuidColumnReader.read                 false   DELTA_BYTE_ARRAY  thrpt   20   20.883 ± 0.521  ops/s

The iceberg spec lists uuid as a valid schema type. Presto supports
UUID types but there was no support for reading or writing them
in the connector.

This commit makes the necessary changes in the connector to create
tables with UUID columns and support for UUIDs in the parquet reader.
This includes an implementation for UUIDs in the batchreader.
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants