Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Fix inconsistent workflow status caused by upload payload exception d… #2491

Merged
merged 4 commits into from
Oct 7, 2021

Conversation

jxu-nflx
Copy link
Contributor

@jxu-nflx jxu-nflx commented Sep 28, 2021

…uring workflow termination

Pull Request type

  • [x ] Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Other (please describe):

Changes in this PR

Catch exception of external storage upload failures during workflow termination and continue the execution to persist workflow status, etc.
Describe the new behavior from this PR, and why it's needed
Issue #2323

Alternatives considered

Describe alternative implementation you have considered

@jxu-nflx jxu-nflx requested a review from aravindanr September 28, 2021 18:35
@github-actions
Copy link
Contributor

github-actions bot commented Sep 28, 2021

Unit Test Results

   154 files  ±0     154 suites  ±0   8m 34s ⏱️ +11s
1 129 tests +3  1 129 ✔️ +3  0 💤 ±0  0 ±0 

Results for commit a02ea86. ± Comparison against base commit 3d716f9.

♻️ This comment has been updated with latest results.


try {
deciderService.updateWorkflowOutput(workflow, null);
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this exception be more specific TerminateWorkflowException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verifyAndUpload method could also throw ApplicationException: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L87 So I use the generic catch of all exceptions here as I feel no matter what exception we got at this step, it shouldn't prevent us from continuing the termination execution

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

LOGGER.error(errorMsg);
throw new TerminateWorkflowException(errorMsg);
workflow.setReasonForIncompletion(errorMsg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines L200- L206 does not look necessary as the catch block in WorkflowExecutor.decide and WorkflowExecutor.terminateWorkflow set the status and reasonForIncompletion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to remove that. But then on a second thought, I think if we get any exception from uploading outputdata to external storage during the terminate step, the final status should be failed instead of terminated with specific error message reflects that. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final status should be TERMINATED. A workflow is marked as FAILED, because of a Task failure or it timing out, not because of internal errors during the workflow evaluation. The reasonForCompletion should also be the original error that triggered the termination, not the internal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure ya updated

throw new TerminateWorkflowException(errorMsg);
workflow.setReasonForIncompletion(errorMsg);
workflow.setStatus(Workflow.WorkflowStatus.FAILED);
if (payloadType == PayloadType.TASK_INPUT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where the failWorkflow is called, the PayloadType should always be WORKFLOW_INPUT or WORKFLOW_OUTPUT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ExternalPayloadStorageUtils.verifyAndUpload() is never called with a Workflow object and PayloadType as TASK_INPUT or TASK_OUTPUT. Workflow's input/output should not be manipulated here since the caller might be already manipulating it.

For instance, if the workflow definition has a failure_workflow defined, the workflow output is manipulated and setting it to null here would cause a NullPointerException there.

Long story short, i don't see a reason to update the failWorkflow method to fix this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the logic to setting workflow output or input to null in the failWorkflow method because in normal circustances (i.e. when input or output size don't exceed threshold), we always do set the input/output to null: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L162 So to be consistent, I think it makes sense to have the same logic added for failWorkflow: https://github.com/Netflix/conductor/blob/main/core/src/main/java/com/netflix/conductor/core/utils/ExternalPayloadStorageUtils.java#L135

LOGGER.error(errorMsg);
throw new TerminateWorkflowException(errorMsg);
workflow.setReasonForIncompletion(errorMsg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final status should be TERMINATED. A workflow is marked as FAILED, because of a Task failure or it timing out, not because of internal errors during the workflow evaluation. The reasonForCompletion should also be the original error that triggered the termination, not the internal error.

throw new TerminateWorkflowException(errorMsg);
workflow.setReasonForIncompletion(errorMsg);
workflow.setStatus(Workflow.WorkflowStatus.FAILED);
if (payloadType == PayloadType.TASK_INPUT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ExternalPayloadStorageUtils.verifyAndUpload() is never called with a Workflow object and PayloadType as TASK_INPUT or TASK_OUTPUT. Workflow's input/output should not be manipulated here since the caller might be already manipulating it.

For instance, if the workflow definition has a failure_workflow defined, the workflow output is manipulated and setting it to null here would cause a NullPointerException there.

Long story short, i don't see a reason to update the failWorkflow method to fix this issue.

@aravindanr aravindanr added the type: bug bugs/ bug fixes label Oct 6, 2021
@aravindanr aravindanr merged commit ab746d1 into main Oct 7, 2021
@aravindanr aravindanr deleted the payload-exceed-maxThreshold-fix branch October 7, 2021 16:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug bugs/ bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants