Remove tasks with cleanup logic instead of marking them as failed #152841

mikecote · 2023-03-07T17:39:57Z

Part of #79977 (step 1 and 3).

In this PR, I'm making Task Manager remove tasks instead of updating them with status: failed whenever a task is out of attempts. I've also added an optional cleanup hook to the task runner that can be defined if additional cleanup is necessary whenever a task has been deleted (ex: delete action_task_params).

To verify an ad-hoc task that always fails

With this PR codebase, modify an action to always throw an error
Create an alerting rule that will invoke the action once
See the action fail three times
Observe the task SO is deleted (search by task type / action type) alongside the action_task_params SO

To verify Kibana crashing on the last ad-hoc task attempt

With this PR codebase, modify an action to always throw an error (similar to scenario above) but also add a delay of 10s before the error is thrown (await new Promise((resolve) => setTimeout(resolve, 10000)); and a log message before the delay begins
Create an alerting rule that will invoke the action once
See the action fail twice
On the third run, crash Kibana while the action is waiting for the 10s delay, this will cause the action to still be marked as running while it no longer is
Restart Kibana
Wait 5-10m until the task's retryAt is overdue
Observe the task getting deleted and the action_task_params getting deleted

To verify recurring tasks that continuously fail

With this PR codebase, modify a rule type to always throw an error when it runs
Create an alerting rule of that type (with a short interval)
Observe the rule continuously running and not getting trapped into the PR changes

Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036

…move-failed-tasks

…ndle-failed-tasks

…-ref HEAD~1..HEAD --fix'

…e/kibana into task-manager/handle-failed-tasks

…ndle-failed-tasks

mikecote · 2023-03-22T19:01:37Z

@elasticmachine merge upstream

mikecote · 2023-03-22T21:42:38Z

@elasticmachine merge upstream

mikecote · 2023-03-23T10:29:31Z

@elasticmachine merge upstream

mikecote · 2023-03-23T10:31:22Z

x-pack/test/plugin_api_integration/test_suites/task_manager/task_management.ts

@@ -894,38 +889,6 @@ export default function ({ getService }: FtrProviderContext) {
      });
    });

-    it('should allow a failed task to be rerun using runSoon', async () => {


Allowing failed ad-hoc tasks to be re-run was a feature that was never used, so it no longer works after this PR.

mikecote · 2023-03-23T10:37:11Z

x-pack/plugins/task_manager/server/queries/mark_available_tasks_as_claimed.ts

-      if (ctx._source.task.schedule != null || ctx._source.task.attempts < params.taskMaxAttempts[ctx._source.task.taskType]) {
-        ${setScheduledAtAndMarkAsClaimed}
-      } else {
-        ctx._source.task.status = "failed";
-      }
+      ${setScheduledAtAndMarkAsClaimed}


Logic to mark tasks as failed is moved into x-pack/plugins/task_manager/server/polling_lifecycle.ts so the cleanup hook can also get called when necessary.

mikecote · 2023-03-23T10:40:10Z

x-pack/plugins/task_manager/server/plugin.ts

@@ -275,7 +271,7 @@ export class TaskManagerPlugin
        taskStore.aggregate(opts),
      get: (id: string) => taskStore.get(id),
      remove: (id: string) => taskStore.remove(id),
-      bulkRemoveIfExist: (ids: string[]) => bulkRemoveIfExist(taskStore, ids),


Replacing bulkRemoveIfExists with bulkRemove since the error path within x-pack/plugins/task_manager/server/lib/bulk_remove_if_exist.ts would never get reached in a bulk request (you have to loop the responses, no errors thrown in 404 scenario). I added code to handle 404 within x-pack/plugins/alerting/server/rules_client/common/try_to_remove_tasks.ts.

mikecote · 2023-03-23T10:44:34Z

x-pack/plugins/actions/server/lib/task_runner_factory.ts

-  getUnsecuredSavedObjectsClient: (request: KibanaRequest) => SavedObjectsClientContract;
+  savedObjectsRepository: ISavedObjectsRepository;


Task runner factory now works with the savedObjectsRepository given we no longer have a request object within the cleanup function. There are comments about having this operation secured but, after thinking of it, it's an implementation detail the system should manage (no need for RBAC, RBAC is on the action SO logic).

mikecote · 2023-03-23T11:12:34Z

x-pack/plugins/actions/server/lib/task_runner_factory.ts

@@ -115,12 +118,6 @@ export class TaskRunnerFactory {
        const request = getFakeRequest(apiKey);
        basePathService.set(request, path);

-        // TM will treat a task as a failure if `attempts >= maxAttempts`


Task runner no longer cares if the task is retried or not. It will throw the appropriate error, log errors messages and leave it up to Task Manager to determine if it's done retrying or not.

mikecote · 2023-03-24T11:12:59Z

@elasticmachine merge upstream

This reverts commit b0b3d68.

…e/kibana into task-manager/handle-failed-tasks

kibana-ci · 2023-03-24T15:04:08Z

💚 Build Succeeded

Buildkite Build
Commit: 07aed82

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`securitySolution`	433	436	+3

Total ESLint disabled count

id	before	after	diff
`securitySolution`	513	516	+3

History

💚 Build #115265 succeeded d3658df
💔 Build #114945 failed 78846e4
💔 Build #114826 failed 3e7a773
💛 Build #114787 was flaky b112705
💛 Build #114614 was flaky 80b8e4f
💛 Build #114336 was flaky 4e296e9

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

mikecote · 2023-03-24T15:08:39Z

x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group4/run_soon.ts

+      // Not 100% sure why, seems the rules need to be loaded separately to avoid the task
+      // failing to load the rule during execution and deleting itself. Otherwise
+      // we have flakiness


Flaky test runner happy with this change: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2036

Flaky test runner not happy without this change: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2030

elasticmachine · 2023-03-24T15:09:07Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1

LGTM! Verified everything works as expected.

…together (#153803) Resolves #153800 Resolves #142704 Resolves #153801 Resolves #142947 Resolves #140867 Similar to #152841 (comment), the rule and tasks archives don't seem to play nicely when combined. The flakiness goes away when loading the rules then the tasks in sequence. Otherwise, the tasks sometimes run before it can find the rule, causing the task to delete itself. I took a look at why the task would run an not be able to find the rule. My best guess after looking at a failing flaky test is that the task manager migration completes before the .kibana. And while .kibana migrates, the task runs and fails to load the task because the .kibana index is in an interim state. Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045 --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Part of #79977 (step 2). Resolves #79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the #152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until #152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

…together (elastic#153803) Resolves elastic#153800 Resolves elastic#142704 Resolves elastic#153801 Resolves elastic#142947 Resolves elastic#140867 Similar to elastic#152841 (comment), the rule and tasks archives don't seem to play nicely when combined. The flakiness goes away when loading the rules then the tasks in sequence. Otherwise, the tasks sometimes run before it can find the rule, causing the task to delete itself. I took a look at why the task would run an not be able to find the rule. My best guess after looking at a failing flaky test is that the task manager migration completes before the .kibana. And while .kibana migrates, the task runs and fails to load the task because the .kibana index is in an interim state. Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2045 --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

…1873) Part of elastic#79977 (step 2). Resolves elastic#79977. In this PR, I'm removing the recurring task defined by the actions plugin that removes unused `action_task_params` SOs. With the elastic#152841 PR, tasks will no longer get marked as failed and we have a migration script (`excludeOnUpgrade`) that removes all tasks and action_task_params that are leftover during the migration https://github.com/elastic/kibana/blob/main/x-pack/plugins/actions/server/saved_objects/index.ts#L81-L94. ~~NOTE: I will hold off merging this PR until elastic#152841 is merged.~~ (merged) ## To verify Not much to test here, but on a Kibana from `main` there will be this task type running in the background and moving to this PR will cause the task to get deleted because it is part of the `REMOVED_TYPES` array in Task Manager. --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

mikecote added 4 commits February 22, 2023 09:27

Remove tasks instead of marking them as failed

b95bccd

Merge branch 'main' of github.com:elastic/kibana into task-manager/re…

b991e07

…move-failed-tasks

Merge branch 'main' of github.com:elastic/kibana into task-manager/re…

b583f34

…move-failed-tasks

Initial commit

9840419

mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0 labels Mar 7, 2023

mikecote self-assigned this Mar 7, 2023

mikecote and others added 14 commits March 8, 2023 14:41

Merge with main

4de2462

Add cleanup hook and handle tasks that are timed out and out of attempts

934674a

Merge with main

63a24fd

Provide namespace to repository when deleting action_task_params

2c58dc3

Merge branch 'main' of github.com:elastic/kibana into task-manager/ha…

7448a6f

…ndle-failed-tasks

Fix failing tests and cleanup bulkRemoveIfExists

b530399

Fix failing test

0524509

[CI] Auto-commit changed files from 'node scripts/precommit_hook.js -…

9ac4f15

…-ref HEAD~1..HEAD --fix'

Fix typechecks

0e19d9a

Merge branch 'task-manager/handle-failed-tasks' of github.com:mikecot…

4fbf8bd

…e/kibana into task-manager/handle-failed-tasks

Added some unit tests

4e296e9

Merge branch 'main' of github.com:elastic/kibana into task-manager/ha…

d3c5884

…ndle-failed-tasks

Try different archives for rules and tasks

f4965f4

Remove exclusive test

80b8e4f

Merge branch 'main' into task-manager/handle-failed-tasks

b112705

Merge branch 'main' into task-manager/handle-failed-tasks

3e7a773

Merge branch 'main' into task-manager/handle-failed-tasks

77fd8f7

mikecote commented Mar 23, 2023

View reviewed changes

mikecote added 2 commits March 23, 2023 07:03

Use two different data archives

b0b3d68

Keep logged errors

704892e

mikecote commented Mar 23, 2023

View reviewed changes

mikecote added 2 commits March 23, 2023 07:14

Add logger assertions back into unit tests

82376d6

Speed up polling cycle

78846e4

mikecote mentioned this pull request Mar 23, 2023

Remove task to cleanup action_task_params of failed tasks #151873

Merged

kibanamachine and others added 4 commits March 24, 2023 07:13

Merge branch 'main' into task-manager/handle-failed-tasks

d3658df

Revert "Use two different data archives"

c7b4e7c

This reverts commit b0b3d68.

Add comments

74c2a70

Merge branch 'task-manager/handle-failed-tasks' of github.com:mikecot…

07aed82

…e/kibana into task-manager/handle-failed-tasks

mikecote commented Mar 24, 2023

View reviewed changes

mikecote marked this pull request as ready for review March 24, 2023 15:09

mikecote requested a review from a team as a code owner March 24, 2023 15:09

ymao1 approved these changes Mar 27, 2023

View reviewed changes

mikecote merged commit 676aec7 into elastic:main Mar 27, 2023

kibanamachine added the backport:skip This commit does not require backporting label Mar 27, 2023

mikecote mentioned this pull request Mar 28, 2023

Fix flaky RBAC Legacy test due to archive containing tasks and rules together #153803

Merged

pmuellr mentioned this pull request Jul 29, 2024

[ResponseOps][TaskManager] fix limited concurrency starvation in mget task claimer #187809

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tasks with cleanup logic instead of marking them as failed #152841

Remove tasks with cleanup logic instead of marking them as failed #152841

mikecote commented Mar 7, 2023 •

edited

Loading

mikecote commented Mar 22, 2023

mikecote commented Mar 22, 2023

mikecote commented Mar 23, 2023

mikecote Mar 23, 2023

mikecote Mar 23, 2023 •

edited

Loading

mikecote Mar 23, 2023

mikecote Mar 23, 2023

mikecote Mar 23, 2023

mikecote commented Mar 24, 2023

kibana-ci commented Mar 24, 2023

ESLint disabled line counts

Total ESLint disabled count

mikecote Mar 24, 2023

elasticmachine commented Mar 24, 2023

ymao1 left a comment

		getUnsecuredSavedObjectsClient: (request: KibanaRequest) => SavedObjectsClientContract;
		savedObjectsRepository: ISavedObjectsRepository;

Remove tasks with cleanup logic instead of marking them as failed #152841

Remove tasks with cleanup logic instead of marking them as failed #152841

Conversation

mikecote commented Mar 7, 2023 • edited Loading

To verify an ad-hoc task that always fails

To verify Kibana crashing on the last ad-hoc task attempt

To verify recurring tasks that continuously fail

mikecote commented Mar 22, 2023

mikecote commented Mar 22, 2023

mikecote commented Mar 23, 2023

mikecote Mar 23, 2023

Choose a reason for hiding this comment

mikecote Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

mikecote Mar 23, 2023

Choose a reason for hiding this comment

mikecote Mar 23, 2023

Choose a reason for hiding this comment

mikecote Mar 23, 2023

Choose a reason for hiding this comment

mikecote commented Mar 24, 2023

kibana-ci commented Mar 24, 2023

💚 Build Succeeded

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

mikecote Mar 24, 2023

Choose a reason for hiding this comment

elasticmachine commented Mar 24, 2023

ymao1 left a comment

Choose a reason for hiding this comment

mikecote commented Mar 7, 2023 •

edited

Loading

mikecote Mar 23, 2023 •

edited

Loading