-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform S3 directory deletion with batch requests #13974
Conversation
ec46003
to
1e8e3a6
Compare
I'm not sure this logic totally makes sense, because an S3 object with Content-Type: "application/octet-stream", with 0 bytes and a key that doesn't end in Now with that said, it's important to remember that S3 is not a file system, it's an object store so it doesn't really have directories. So there might be a better option for how to handle
You could instead do something like (simplified to ignore details):
Which will be much more efficient, faster, and completely remove all S3 objects with actual keys starting with the prefix |
1e8e3a6
to
c31e12b
Compare
@pettyjamesm I have addressed your comment. As you mentioned offline there is definitely room for improvement in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added review comments, I think we probably want to reorganize the logic to check recursive
explicitly instead of handling that by falling through, since that's now a much more significant factor in the implementation logic. Recursive deletes for "implied directories" (ie: key does not exist but is a prefix for keys that do) and for "actual directories" (key exists, but has content-type DIRECTORY_MEDIA_TYPE
) can proceed, but "empty objects" must be considered files and should not allow recursive deletes.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// check if this path is a directory | ||
Iterator<LocatedFileStatus> iterator = listPath(path, OptionalInt.of(1), ListingMode.SHALLOW_ALL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check will succeed for non-directory objects, since listing a prefix for any key that exists will return the object at that key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do note that listPath uses path
+ /
as prefix of the listing and not path
.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
c31e12b
to
cc229b8
Compare
I agree. The current state of the code for the @pettyjamesm I've added a set of tests against AWS S3 object storage in order to be able to test the assumptions from the code. |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
We’re planning to replace |
@electrum I understand the direction of replacing |
plugin/trino-hive/src/test/java/io/trino/plugin/hive/s3/TestTrinoS3FileSystemAwsS3.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/s3/TestTrinoS3FileSystemAwsS3.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/s3/TestTrinoS3FileSystemAwsS3.java
Outdated
Show resolved
Hide resolved
cc229b8
to
facfc1c
Compare
facfc1c
to
ead70f9
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Show resolved
Hide resolved
ead70f9
to
a2f8c13
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java
Outdated
Show resolved
Hide resolved
3c95581
to
12265ea
Compare
/test-with-secrets sha=12265ea06d6ac9e006bb56973cb23cc54a8e9d2d |
12265ea
to
07e689f
Compare
(rebased to fix test-with-secrets job) |
/test-with-secrets sha=07e689f46a2a40fe6e618b02a0eae4a2ff492b76 |
Run with secrets: https://github.com/trinodb/trino/actions/runs/3591234020 |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3591234020 |
07e689f
to
1de03f0
Compare
Delete all the objects which are found under the specified path prefix in batches in order to improve the performance of the operation. This operation is customised for the needs of Trino and does intentionally not intend to add unnecessary complexity for dealing with the various quirks of s3 compatible object storage systems that are not relevant for the use cases of Trino.
1de03f0
to
9cabfc0
Compare
@findinpath what were the changes in the last two force-pushes (#13974 (comment), #13974 (comment))? |
I have changed the |
/test-with-secrets sha=ba2d0b67d0890174483733fe48012373c7478ead |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3623643550 |
/test-with-secrets sha=62a44129acd1d4db0e8630062e3c01901bac0882 |
CI Hit #13199 |
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3657194620 |
@nineinchnick do you understand the |
Description
Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.
Fix
Hive, Delta, Iceberg (Lakehouse connectors)
Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.
Related issues, pull requests, and links
Fixes #13017
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: