Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37894][SQL] Add trash feature to FileCommitProtocol.deleteWithJob #35188

Closed
wants to merge 2 commits into from
Closed

[SPARK-37894][SQL] Add trash feature to FileCommitProtocol.deleteWithJob #35188

wants to merge 2 commits into from

Conversation

sleep1661
Copy link
Contributor

What changes were proposed in this pull request?

Add trash feature to function FileCommitProtocol.deleteWithJob. When enable trash, use Trash.moveToAppropriateTrash instead of fs.delete

Why are the changes needed?

For some reason (e.g. analyze or roll back data) we need keep old data to trash.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the CORE label Jan 13, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sleep1661
Copy link
Contributor Author

@dongjoon-hyun Could you help take a look when you have time? Thanks.

@sleep1661
Copy link
Contributor Author

cc @cloud-fan @mridulm

@@ -178,7 +178,12 @@ abstract class FileCommitProtocol extends Logging {
* implementation deletes the file immediately.
*/
def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
fs.delete(path, recursive)
if (fs.getConf.getInt("fs.trash.interval", 0) > 0 &&
Copy link
Contributor

@cloud-fan cloud-fan Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find the document of fs.trash.interval? It's better to add a comment to link to the hadoop conf doc page, to explain that this conf indicates enabling the trash feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added a comment to link hadoop conf doc page. Thanks.

@sleep1661 sleep1661 requested a review from cloud-fan February 19, 2022 05:59
*/
def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
fs.delete(path, recursive)
if (fs.getConf.getInt("fs.trash.interval", 0) > 0 &&
Trash.moveToAppropriateTrash(fs, path, fs.getConf)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this function call fail? if it will then it's better to try-catch it, log the error and fallback to non-trash behavior.

@cloud-fan
Copy link
Contributor

cc @HyukjinKwon @MaxGekk

*/
def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = {
fs.delete(path, recursive)
if (fs.getConf.getInt("fs.trash.interval", 0) > 0 &&
Trash.moveToAppropriateTrash(fs, path, fs.getConf)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of #29552. I think at least the concerns raised there should be addressed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the trash feature was merged and reverted historically from Apache Spark repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trash is a feature of Hadoop file system, and we need to think about how to integrate it with Spark. This may not be the only place to consider the trash feature.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants