-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle interruption in RetryDriver and TaskRunner #18964
Handle interruption in RetryDriver and TaskRunner #18964
Conversation
8df783b
to
e05a499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments
presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
presto-hive-metastore/src/main/java/com/facebook/presto/hive/RetryDriver.java
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/execution/executor/TaskExecutor.java
Show resolved
Hide resolved
2285985
to
394f561
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When a thread is interrupted during an S3 client operation the SDK client internally catches the InterruptedException, sets the thread back to interrupted, and throws an AbortedException instead of InterruptedException. When this occurs, the RetryDriver should stop attempting retries. This change adds AbortedException to the stopOn list for all retry drivers in PrestoS3FileSystem and S3SelectLineRecordReader
Stop attempting retries in Hive's RetryDriver if an InterruptedException is caught or when the current thread is interrupted and immediately throw the current exception. Otherwise, RetryDriver might interfere with drivers terminating in a timely manner after tasks are terminated via a cancel, failure, or abort.
Drivers may leave the TaskRunner thread's interrupt flag set during the course of processing, but doing so should not result in the TaskRunner terminating its own processing loop until the TaskExecutor is closed. Instead of allowing the interrupt to terminate the current task runner's loop and re-creating a new runner to replace the interrupted one, we can clear the interrupt flag after each iteration. Otherwise, TaskRunners that were interrupted would have to replace themselves and end up creating unnecessary threads in the cached threadpool executor in the process.
394f561
to
db99445
Compare
@@ -139,6 +139,11 @@ public <V> V run(String callableName, Callable<V> callable) | |||
return callable.call(); | |||
} | |||
catch (Exception e) { | |||
// Immediately stop retry attempts once an interrupt has been received | |||
if (e instanceof InterruptedException || Thread.currentThread().isInterrupted()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late comment, Checking the isInterrupted will clear the interrupt flag on the thread, is that expected ? I am not sure who is responsible for clearing the interrupt flag in the Presto codebase.
Sleep on line 163, tries to rightly set the interrupted flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, isInterrupted does not clear the flag, so it is good. but on handling the interruptedException, there are still differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re rethrowing InterruptedException (or whatever other exception was thrown if the current thread is interrupted) but not “handing” it other than to ensure that we stop retrying and let the exception propagate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this makes sense
Cross port of trinodb/trino#15803 with one additional commit to fix handling of S3 client exceptions when interrupted. At a high level, this PR:
AbortedException
to the list ofRetryDriver#stopOn
exceptions for all S3 client operations inPrestoS3FileSystem
andS3SelectLineRecordReader
. When the S3 client receives anInterruptedException
internally during an API call operation, it re-sets the current threads interrupted flag, but re-throws anAbortedException
. When this occurs, the retry driver should stop attempting retries.RetryDriver
exception handling to stop retries when anInterruptedException
is caught orThread.currentThread().isInterrupted()
. Failing to do so could result in drivers that were interrupted as part of task cancellation running significantly longer than necessary by proceeding to retry / backoff instead of exiting.TaskRunner
to continue processing new splits instead of terminating when the current thread was interrupted as a result of task cancellation interrupting the current driver being processed withoutTaskExecutor
having been shut down. Before this change, those interruptions would cause theTaskRunner
to stop processing new splits and submit a newTaskRunner
into the cached threadpool executor, which could needlessly create new worker threads.