Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tasks with cleanup logic instead of marking them as failed #152841

Merged
merged 29 commits into from
Mar 27, 2023

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Mar 7, 2023

Part of #79977 (step 1 and 3).

In this PR, I'm making Task Manager remove tasks instead of updating them with status: failed whenever a task is out of attempts. I've also added an optional cleanup hook to the task runner that can be defined if additional cleanup is necessary whenever a task has been deleted (ex: delete action_task_params).

To verify an ad-hoc task that always fails

  1. With this PR codebase, modify an action to always throw an error
  2. Create an alerting rule that will invoke the action once
  3. See the action fail three times
  4. Observe the task SO is deleted (search by task type / action type) alongside the action_task_params SO

To verify Kibana crashing on the last ad-hoc task attempt

  1. With this PR codebase, modify an action to always throw an error (similar to scenario above) but also add a delay of 10s before the error is thrown (await new Promise((resolve) => setTimeout(resolve, 10000)); and a log message before the delay begins
  2. Create an alerting rule that will invoke the action once
  3. See the action fail twice
  4. On the third run, crash Kibana while the action is waiting for the 10s delay, this will cause the action to still be marked as running while it no longer is
  5. Restart Kibana
  6. Wait 5-10m until the task's retryAt is overdue
  7. Observe the task getting deleted and the action_task_params getting deleted

To verify recurring tasks that continuously fail

  1. With this PR codebase, modify a rule type to always throw an error when it runs
  2. Create an alerting rule of that type (with a short interval)
  3. Observe the rule continuously running and not getting trapped into the PR changes

Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0 labels Mar 7, 2023
@mikecote mikecote self-assigned this Mar 7, 2023
@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@@ -894,38 +889,6 @@ export default function ({ getService }: FtrProviderContext) {
});
});

it('should allow a failed task to be rerun using runSoon', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing failed ad-hoc tasks to be re-run was a feature that was never used, so it no longer works after this PR.

if (ctx._source.task.schedule != null || ctx._source.task.attempts < params.taskMaxAttempts[ctx._source.task.taskType]) {
${setScheduledAtAndMarkAsClaimed}
} else {
ctx._source.task.status = "failed";
}
${setScheduledAtAndMarkAsClaimed}
Copy link
Contributor Author

@mikecote mikecote Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic to mark tasks as failed is moved into x-pack/plugins/task_manager/server/polling_lifecycle.ts so the cleanup hook can also get called when necessary.

@@ -275,7 +271,7 @@ export class TaskManagerPlugin
taskStore.aggregate(opts),
get: (id: string) => taskStore.get(id),
remove: (id: string) => taskStore.remove(id),
bulkRemoveIfExist: (ids: string[]) => bulkRemoveIfExist(taskStore, ids),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing bulkRemoveIfExists with bulkRemove since the error path within x-pack/plugins/task_manager/server/lib/bulk_remove_if_exist.ts would never get reached in a bulk request (you have to loop the responses, no errors thrown in 404 scenario). I added code to handle 404 within x-pack/plugins/alerting/server/rules_client/common/try_to_remove_tasks.ts.

Comment on lines -50 to +53
getUnsecuredSavedObjectsClient: (request: KibanaRequest) => SavedObjectsClientContract;
savedObjectsRepository: ISavedObjectsRepository;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task runner factory now works with the savedObjectsRepository given we no longer have a request object within the cleanup function. There are comments about having this operation secured but, after thinking of it, it's an implementation detail the system should manage (no need for RBAC, RBAC is on the action SO logic).

@@ -115,12 +118,6 @@ export class TaskRunnerFactory {
const request = getFakeRequest(apiKey);
basePathService.set(request, path);

// TM will treat a task as a failure if `attempts >= maxAttempts`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task runner no longer cares if the task is retried or not. It will throw the appropriate error, log errors messages and leave it up to Task Manager to determine if it's done retrying or not.

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 433 436 +3

Total ESLint disabled count

id before after diff
securitySolution 513 516 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

Comment on lines +25 to +27
// Not 100% sure why, seems the rules need to be loaded separately to avoid the task
// failing to load the rule during execution and deleting itself. Otherwise
// we have flakiness
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikecote mikecote marked this pull request as ready for review March 24, 2023 15:09
@mikecote mikecote requested a review from a team as a code owner March 24, 2023 15:09
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified everything works as expected.

@mikecote mikecote merged commit 676aec7 into elastic:main Mar 27, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Mar 27, 2023
mikecote added a commit that referenced this pull request Mar 29, 2023
…together (#153803)

Resolves #153800
Resolves #142704
Resolves #153801
Resolves #142947
Resolves #140867

Similar to
#152841 (comment),
the rule and tasks archives don't seem to play nicely when combined. The
flakiness goes away when loading the rules then the tasks in sequence.
Otherwise, the tasks sometimes run before it can find the rule, causing
the task to delete itself.

I took a look at why the task would run an not be able to find the rule.
My best guess after looking at a failing flaky test is that the task
manager migration completes before the .kibana. And while .kibana
migrates, the task runs and fails to load the task because the .kibana
index is in an interim state.

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
mikecote added a commit that referenced this pull request Mar 29, 2023
Part of #79977 (step 2).
Resolves #79977.

In this PR, I'm removing the recurring task defined by the actions
plugin that removes unused `action_task_params` SOs. With the
#152841 PR, tasks will no longer
get marked as failed and we have a migration script (`excludeOnUpgrade`)
that removes all tasks and action_task_params that are leftover during
the migration
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94.

~~NOTE: I will hold off merging this PR until
#152841 is merged.~~ (merged)

## To verify

Not much to test here, but on a Kibana from `main` there will be this
task type running in the background and moving to this PR will cause the
task to get deleted because it is part of the `REMOVED_TYPES` array in
Task Manager.

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
jgowdyelastic pushed a commit to jgowdyelastic/kibana that referenced this pull request Mar 30, 2023
…together (elastic#153803)

Resolves elastic#153800
Resolves elastic#142704
Resolves elastic#153801
Resolves elastic#142947
Resolves elastic#140867

Similar to
elastic#152841 (comment),
the rule and tasks archives don't seem to play nicely when combined. The
flakiness goes away when loading the rules then the tasks in sequence.
Otherwise, the tasks sometimes run before it can find the rule, causing
the task to delete itself.

I took a look at why the task would run an not be able to find the rule.
My best guess after looking at a failing flaky test is that the task
manager migration completes before the .kibana. And while .kibana
migrates, the task runs and fails to load the task because the .kibana
index is in an interim state.

Flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
jgowdyelastic pushed a commit to jgowdyelastic/kibana that referenced this pull request Mar 30, 2023
…1873)

Part of elastic#79977 (step 2).
Resolves elastic#79977.

In this PR, I'm removing the recurring task defined by the actions
plugin that removes unused `action_task_params` SOs. With the
elastic#152841 PR, tasks will no longer
get marked as failed and we have a migration script (`excludeOnUpgrade`)
that removes all tasks and action_task_params that are leftover during
the migration
https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94.

~~NOTE: I will hold off merging this PR until
elastic#152841 is merged.~~ (merged)

## To verify

Not much to test here, but on a Kibana from `main` there will be this
task type running in the background and moving to this PR will cause the
task to get deleted because it is part of the `REMOVED_TYPES` array in
Task Manager.

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants