Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg table properties for maintenance of previous metadata files #23260

Merged

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Jul 19, 2024

Description

This PR enables to set Iceberg table properties metadata_previous_versions_max and metadata_delete_after_commit to specify the number of previous metadata files that need to be tracked, and whether to delete the metadata files that no longer be tracked after each commit.

Besides, this PR also enable our own version of HiveTableOperations to support deleting old metadata files after commit based on the table properties. This feature has been supported by Iceberg's native version of HiveTableOperations.

Motivation and Context

The metadata files of an iceberg table will grow to a very large amount, as almost each operation will lead in a new one.
So we need to provide a method to delete unnecessary metadata files.

Impact

When executing show create table ..., two additional table properties would be shown. Except that, there is no impact on existing use cases, as the default values are the same as before.

Test Plan

  • Test cases added to show the impact of new table properties

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add table properties `metadata_previous_versions_max` and `metadata_delete_after_commit` to maintain the previous metadata files :pr:`23260`
* Enable Iceberg with hive catalog to support deleting old metadata files after commit based on the table properties :pr:`23260`

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Suggested rephrasing, please adapt my suggestions to fit the column formatting, and let me know if my suggestions change the meaning in a way you're not comfortable with.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
steveburnett
steveburnett previously approved these changes Jul 19, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Thanks for the update! Pulled updated branch, new local doc build, looks good.

ZacBlanco
ZacBlanco previously approved these changes Jul 19, 2024
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nits to make the core logic more understandable for future readers, otherwise LGTM

Comment on lines +480 to +483
// TableMetadata#addPreviousFile builds up the metadata log and uses
// TableProperties.METADATA_PREVIOUS_VERSIONS_MAX to determine how many files should stay in
// the log, thus we don't include metadata.previousFiles() for deletion - everything else can
// be removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was confusing to me at first because TableMetadata#addPreviousFile isn't referenced in this logic, but rather indirectly when generating previousFiles. I think we can rewrite it to be a bit more clear

Suggested change
// TableMetadata#addPreviousFile builds up the metadata log and uses
// TableProperties.METADATA_PREVIOUS_VERSIONS_MAX to determine how many files should stay in
// the log, thus we don't include metadata.previousFiles() for deletion - everything else can
// be removed
// metadata.previousFiles() is populated using TableProperties.METADATA_PREVIOUS_VERSIONS_MAX
// thus we don't include metadata.previousFiles() for deletion - everything else can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, it makes the meaning of the comment more clear. Fixed, please take a look when available!

ZacBlanco
ZacBlanco previously approved these changes Jul 20, 2024
agrawalreetika
agrawalreetika previously approved these changes Jul 20, 2024
Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the addition. Have suggested small doc level changes, please have a look.
Otherwise LGTM

}

@Config("iceberg.metadata-previous-versions-max")
@ConfigDescription("The max number of previous metadata files exist in metadata log")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Would it make more clear if we has a description The max number of old metadata files to keep since this is something now user can control with this config property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it seems better, I have changed here as well as the corresponding description in IcebergTableProperties and docs. Please take a look when convenient. @agrawalreetika @steveburnett @ZacBlanco

@@ -355,6 +361,12 @@ Property Name Description
``delete_mode`` Optionally specifies the write delete mode of the Iceberg ``merge-on-read``
specification to use for new tables, either ``copy-on-write``
or ``merge-on-read``.

``metadata_previous_versions_max`` Optionally specifies the max number of old metadata files to ``100``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if metadata_delete_after_commit = false but metadata_previous_versions_max = 100? Would the files be deleted after 100? If not, shall we specify that metadata_previous_versions_max only works if metadata_delete_after_commit is set to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata_delete_after_commit = false means that Iceberg wouldn't automatically delete the old and untracked metadata files after each commit. It wouldn't affect the functionality of metadata_previous_versions_max, which specify how many previous metadata files would be tracked in current metadata.

For example, if we set metadata_previous_versions_max = 1 and metadata_delete_after_commit = false on the creation of an Iceberg table, and execute several insert operations which will lead in several metadata files.
We will get only 1 previous metadata entry in current metadata's log ( through tableMetadata.previousFiles()). But find that none of the metadata files were actually deleted.

On the other way, if we set metadata_previous_versions_max = 1 and metadata_delete_after_commit = true and do the same things as above, we will find that only 2 metadata files remain.

@tdcmeehan tdcmeehan self-assigned this Jul 23, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local docs build. Looks good, thanks!

@hantangwangd hantangwangd merged commit 28d7111 into prestodb:master Jul 29, 2024
57 checks passed
@hantangwangd hantangwangd deleted the add_iceberg_table_properties branch July 29, 2024 17:13
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants