-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix inconsistent workflow status caused by upload payload exception d… #2491
Conversation
…uring workflow termination
|
||
try { | ||
deciderService.updateWorkflowOutput(workflow, null); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this exception be more specific TerminateWorkflowException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verifyAndUpload
method could also throw ApplicationException
: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L87 So I use the generic catch of all exceptions here as I feel no matter what exception we got at this step, it shouldn't prevent us from continuing the termination execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
LOGGER.error(errorMsg); | ||
throw new TerminateWorkflowException(errorMsg); | ||
workflow.setReasonForIncompletion(errorMsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines L200- L206 does not look necessary as the catch block in WorkflowExecutor.decide and WorkflowExecutor.terminateWorkflow set the status and reasonForIncompletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to remove that. But then on a second thought, I think if we get any exception from uploading outputdata to external storage during the terminate step, the final status should be failed
instead of terminated
with specific error message reflects that. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the final status should be TERMINATED
. A workflow is marked as FAILED, because of a Task failure or it timing out, not because of internal errors during the workflow evaluation. The reasonForCompletion
should also be the original error that triggered the termination, not the internal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure ya updated
throw new TerminateWorkflowException(errorMsg); | ||
workflow.setReasonForIncompletion(errorMsg); | ||
workflow.setStatus(Workflow.WorkflowStatus.FAILED); | ||
if (payloadType == PayloadType.TASK_INPUT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From where the failWorkflow
is called, the PayloadType
should always be WORKFLOW_INPUT
or WORKFLOW_OUTPUT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the payloadType
is passed into the verifyAndUpload
method: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L87 And we always verify both input and output data: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/execution/DeciderService.java#L566
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. ExternalPayloadStorageUtils.verifyAndUpload()
is never called with a Workflow
object and PayloadType
as TASK_INPUT
or TASK_OUTPUT
. Workflow's input/output should not be manipulated here since the caller might be already manipulating it.
For instance, if the workflow definition has a failure_workflow
defined, the workflow output is manipulated and setting it to null
here would cause a NullPointerException
there.
Long story short, i don't see a reason to update the failWorkflow
method to fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the logic to setting workflow output or input to null
in the failWorkflow
method because in normal circustances (i.e. when input or output size don't exceed threshold), we always do set the input/output to null
: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L162 So to be consistent, I think it makes sense to have the same logic added for failWorkflow
: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L135
LOGGER.error(errorMsg); | ||
throw new TerminateWorkflowException(errorMsg); | ||
workflow.setReasonForIncompletion(errorMsg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the final status should be TERMINATED
. A workflow is marked as FAILED, because of a Task failure or it timing out, not because of internal errors during the workflow evaluation. The reasonForCompletion
should also be the original error that triggered the termination, not the internal error.
throw new TerminateWorkflowException(errorMsg); | ||
workflow.setReasonForIncompletion(errorMsg); | ||
workflow.setStatus(Workflow.WorkflowStatus.FAILED); | ||
if (payloadType == PayloadType.TASK_INPUT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. ExternalPayloadStorageUtils.verifyAndUpload()
is never called with a Workflow
object and PayloadType
as TASK_INPUT
or TASK_OUTPUT
. Workflow's input/output should not be manipulated here since the caller might be already manipulating it.
For instance, if the workflow definition has a failure_workflow
defined, the workflow output is manipulated and setting it to null
here would cause a NullPointerException
there.
Long story short, i don't see a reason to update the failWorkflow
method to fix this issue.
…oid NullPointerException
…uring workflow termination
Pull Request type
Changes in this PR
Catch exception of external storage upload failures during workflow termination and continue the execution to persist workflow status, etc.
Describe the new behavior from this PR, and why it's needed
Issue #2323
Alternatives considered
Describe alternative implementation you have considered