[Security Solution][Detections] Signals Migration API #84721

rylnd · 2020-12-02T02:16:42Z

Summary

This implements the initial API to drive migration of signals indices.

It's composed of three main endpoints:

endpoint to retrieve migration statuses
- receives a date representing how far back you want signals to be upgraded
- returns a list of concrete indexes and their relevant migration info
endpoint to "migrate" a given index(es)
- kicks off a reindex task for each index
- returns migration token along with debugging info (e.g. indexes, underlying task ID)
endpoint to "finalize" the migration of a given index
- performs validations on the reindex task and the resulting index
- if valid & complete, updates aliases and applies a deletion policy to the old index

There are also dev scripts corresponding to each of these endpoints.

Additionally, this PR adds a signal._meta.version field to signals documents, which represents the version of the detection engine that generated the signal. New signals are written with the value from SIGNALS_TEMPLATE_VERSION, and old signals have this field populated when they are migrated.

For review purposes, I would suggest starting with the API integration tests as they provide the best overview of this work.

Outstanding questions/notes:

Signals vs. alerts: the API is a bit behind the UI and still includes references to signals in e.g. the API path. Do we continue referring to signals in the API, or should we take steps to migrate to alerts here?
Migration task deletion:
- Retrieving tasks programmatically is imperfect; we can query on a task's action and description, but we have no way to distinguish a migration-initiated reindex from one that a user initiated manually.
- We could minimize the chance of deleting the latter by specifying e.g. task.description: "to [${signalsAlias}-*-r*]", but even that is no guarantee.
- UPDATE: This will be handled in a followup PR via persisting a Saved Object.

TODO

fix folder structure (things live in a bunch of places)
function documentation
integration tests
unit tests
endpoint to apply 30d deletion policy to any dangling signals indexes (followup PR)

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

* Removes obsolete endpoints/functions * Adds endpoint for checking the migration status of signals indices * Adds helper functions to represent the logical pieces of answering that question

* triggers reindex for each index * starts implementing followup endpoint to "finalize" after reindexing is finished

Still moving logic around a bunch.

Instead of e.g. `.siem-signals-default-000001-r5`, this will generate `.siem-signals-default-000001-r000005`. This shouldn't matter much, but it may make it easier for users at a glance to see the story of each index.

* Verifies that task matches the specified parameters * Verifies that document counts are the same * updates aliases * finalization endpoint requires both source/dest indexes since we can't determine that from the task itself.

After upgrading a particular signals index, we're left with both the old and new copies of the index. While the former is unlinked, it's still taking up disk space; this ensures that it will eventually be deleted, but gives users enough time to recover data if necessary. This also ensures that, as with the normal signals ILM policy, it is present during our normal sanity checks.

* Moves migration-related routes under signals/ to match their routing * Generalizes migration-agnostic helpers, moves them to appropriate folders (namely index/) * Inlined getMigrationStatusInRange, a hyper-specific function with limited utility elsewhere

This is as much to get my thoughts in order as it is for posterity. Next: tests!

* Adds io-ts schema for route params * Adds es_archiver data to represent an outdated signals index

* Adds io-ts schema for route params * Adds second signals index archive, updates docs * Adds test helper to wait for a given index to have documents * Adds test helper to retrieve the relevant index name from a call to esArchive.load

We're no longer making a distinction between an upgrade vs. an update vs. a migration vs. a reindex: a migration is the concept that encompasses this work. Both an index and individual documents can require a migration, but both follow the same code path to migrate.

This will be a slightly better API: rather than having to pass all three fields to finalize the migration, API users can instead send the token.

These often contain detailed information that we were previously dropping. This will give better info on the migration finalization endpoint, but should give more information across all detection_engine endpoints in the case of an es client error.

This lead to a few changes in the responses from our different endpoints; mainly, we pass both the migration token AND its constituent parts to aid in debugging.

This would be really hard to reproduce with an integration test since we'd need to generate a specific reindex failure. Much easier to stub some ES calls to exercise that code in a unit test.

We now record a single document-level version field. This represents the version of the document's _source, which is generated by our rule execution. When either a mapping _or_ a transformation is added, this version will be bumped such that new signals will contain the newest version, while the index itself may still contain the old mappings. The transformation pipeline will use the signal version to short-circuit unnecessary transformations.

This can be determined programatically, but for users manually interpreting this response, the qualification will help.

* getIndexVersion always returns a number * version comparisons use isOutdated

We now generate a version field to indicate the version under which the signal was created/migrated.

Rather than having to perform a manual reindex, this should give API users some control over the performance of their automated migration.

These were failing on our new signal field.

Since this is ultimately just an aggregation query there's not much else to test.

Conflicts: x-pack/plugins/security_solution/server/lib/detection_engine/routes/index/get_signals_template.ts

* Treat write indices as any other index in migration status endpoint * Migration API rejects requests containing write indices * Migration API rejects requests containing unknown/non-signals indices

rylnd · 2020-12-07T21:42:52Z

...s/security_solution/server/lib/detection_engine/migrations/create_signals_migration_index.ts

+  version: number;
+}): Promise<string> => {
+  const paddedVersion = `${version}`.padStart(6, '0');
+  const destinationIndexName = `${index}-r${paddedVersion}`;


Here's how we're naming our destination indexes, so e.g. migrating .siem-signals-default-000002 to template 14 would reindex into .siem-signals-default-000002-r000014.

Note that this is additive as well, so if on a subequent upgrade the user migrated this index again, it'd be something like .siem-signals-default-000002-r000014-r000021 etc. This allows us to track all the migrations that have been performed on a given index, but it also places a limit on the number of times they can migrate before hitting the index name limit.

Do we need the 0 padding on the version numbers? Could reduce some noise and allow them to migrate more before hitting the name length limit.

The padding was added to assist sorting these indexes by name. It's not strictly necessary and not being relied upon anywhere, so we could definitely remove it.

marshallmain

Looks good overall. Left a couple comments about edge cases.

One extra API we may need is an API to list all in-progress migration tasks. It seems brittle if the only way to get the migration token necessary to finalize the migration is from the create migration API. That puts more burden on API users and the eventual frontend code to store any migration tokens until completion.

marshallmain · 2020-12-08T05:35:27Z

x-pack/plugins/security_solution/server/lib/detection_engine/migrations/helpers.ts

+    return (
+      signalVersion.doc_count > 0 && isOutdated({ current: signalVersion.key, target: version })
+    );
+  });


Do we ever expect to find signals with version < the index version during normal operation (i.e. users haven't gone in and messed with the index versions)?

I believe that this could happen as part of a multi-kibana setup, where one instance has been upgraded but another is still generating signals on an old version (and both share the same signals index).

In those situations, I believe the migration recommendation will be to perform the migration on one instance after disabling jobs in all other instances, and then iteratively upgrading/re-enabling jobs on each subsequent instance.

Regardless of how outdated signals make it into one of these indexes, they have the potential to break certain features, and the check is straightforward, so I decided to notify the user of that situation.

x-pack/plugins/security_solution/server/lib/detection_engine/migrations/migrate_signals.ts

...curity_solution/server/lib/detection_engine/routes/signals/create_signals_migration_route.ts

...rity_solution/server/lib/detection_engine/routes/signals/finalize_signals_migration_route.ts

Without this phase, ILM gets confused as it tries to move to the delete phase and fails.

The referenced field has changed.

If we have a recoverable error: e.g. the destination index already exists, or a specified index is a write index, we now report those errors as part of the normal 200 response as these do not preclude other specified indices from being migrated. However, if non-signals indices are specified, we do continue to reject the entire request, as that's indicative of misuse of the endpoint.

kibanamachine · 2020-12-09T04:42:26Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 6762afa

Metrics [docs]

Distributable file count

id	before	after	diff
`default`	46965	47745	+780

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`securitySolution`	210.8KB	211.1KB	+326.0B

History

💚 Build #92938 succeeded f954218
💔 Build #92889 failed 45a3af7
💚 Build #92542 succeeded 0701382
💚 Build #92505 succeeded 63fb0e9
💔 Build #92419 failed 70031ad

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

marshallmain

LGTM

We may need to discuss with the elasticsearch team on how best to manage tasks. It appears that completed tasks aren't returned in the _tasks API unless you provide the specific task ID, so that adds some more complexity to cleaning up orphaned tasks - it may not be worth the effort. An additional API that lists out in-progress signals reindexing tasks would be convenient in the future, but not essential right away.

rylnd · 2020-12-10T17:46:23Z

It appears that completed tasks aren't returned in the _tasks API unless you provide the specific task ID

Yep, persisting migrations as Saved Objects fixes this. I'll reference that one here when it's up 👍

* WIP: basic reindexing works, lots of edge cases and TODOs to tackle * Add note * Add version metadata to signals documents * WIP: Starting over from the ground up * Removes obsolete endpoints/functions * Adds endpoint for checking the migration status of signals indices * Adds helper functions to represent the logical pieces of answering that question * Fleshing out upgrade of signals * triggers reindex for each index * starts implementing followup endpoint to "finalize" after reindexing is finished * Fleshing out more of the upgrade path Still moving logic around a bunch. * Pad the version number of our destination migration index Instead of e.g. `.siem-signals-default-000001-r5`, this will generate `.siem-signals-default-000001-r000005`. This shouldn't matter much, but it may make it easier for users at a glance to see the story of each index. * Fleshing out more upgrade finalization * Verifies that task matches the specified parameters * Verifies that document counts are the same * updates aliases * finalization endpoint requires both source/dest indexes since we can't determine that from the task itself. * Ensure that new signals are generated with an appropriate schema_version * Apply migration cleanup policy to obsolete signals indexes After upgrading a particular signals index, we're left with both the old and new copies of the index. While the former is unlinked, it's still taking up disk space; this ensures that it will eventually be deleted, but gives users enough time to recover data if necessary. This also ensures that, as with the normal signals ILM policy, it is present during our normal sanity checks. * Move more logic into component functions * Fix type errors * Refactor to make things a little more organized * Moves migration-related routes under signals/ to match their routing * Generalizes migration-agnostic helpers, moves them to appropriate folders (namely index/) * Inlined getMigrationStatusInRange, a hyper-specific function with limited utility elsewhere * Add some JSDoc comments around our new functions This is as much to get my thoughts in order as it is for posterity. Next: tests! * Adds integration tests around migration status route * Adds io-ts schema for route params * Adds es_archiver data to represent an outdated signals index * Adds API integration tests for our signals upgrade endpoint * Adds io-ts schema for route params * Adds second signals index archive, updates docs * Adds test helper to wait for a given index to have documents * Adds test helper to retrieve the relevant index name from a call to esArchive.load * WIP: Fleshing out finalization tests * Consolidate terminalogy around a migration We're no longer making a distinction between an upgrade vs. an update vs. a migration vs. a reindex: a migration is the concept that encompasses this work. Both an index and individual documents can require a migration, but both follow the same code path to migrate. * Implement encoding of migration details This will be a slightly better API: rather than having to pass all three fields to finalize the migration, API users can instead send the token. * Better transformation of errors thrown from the elasticsearch client These often contain detailed information that we were previously dropping. This will give better info on the migration finalization endpoint, but should give more information across all detection_engine endpoints in the case of an es client error. * Finishing integration tests around finalization endpoint This lead to a few changes in the responses from our different endpoints; mainly, we pass both the migration token AND its constituent parts to aid in debugging. * Test an error case due to a reindexing failure This would be really hard to reproduce with an integration test since we'd need to generate a specific reindex failure. Much easier to stub some ES calls to exercise that code in a unit test. * Remove unnecessary version info from signals documents We now record a single document-level version field. This represents the version of the document's _source, which is generated by our rule execution. When either a mapping _or_ a transformation is added, this version will be bumped such that new signals will contain the newest version, while the index itself may still contain the old mappings. The transformation pipeline will use the signal version to short-circuit unnecessary transformations. * Migrate an index relative to the ACTUAL template version This handles the case where a user is attempting to migrate, but has not yet rolled over to the newest template. Running rules may insert "new" signals into an "old" index, but from the perspective of the app no migration is necessary in that case. If/when they roll over, the aforementioned index (and possibly older ones) will be qualified as outdated, and can be migrated. * Enrich our migration_status endpoint with an is_outdated qualification This can be determined programatically, but for users manually interpreting this response, the qualification will help. * Update migration scripts * More uniform version checking * getIndexVersion always returns a number * version comparisons use isOutdated * Fix signal generation unit tests We now generate a version field to indicate the version under which the signal was created/migrated. * Support reindex options to be sent to create_migration endpoint Rather than having to perform a manual reindex, this should give API users some control over the performance of their automated migration. * Fix signal generation integration tests These were failing on our new signal field. * Add unit tests for getMigrationStatus * Add a basic test for getSignalsIndicesInRange Since this is ultimately just an aggregation query there's not much else to test. * Add unit test for the naming of our destination migration index * Handle write indices in our migration logic * Treat write indices as any other index in migration status endpoint * Migration API rejects requests containing write indices * Migration API rejects requests containing unknown/non-signals indices * Add original hot phase to migration cleanup policy Without this phase, ILM gets confused as it tries to move to the delete phase and fails. * Update old comment The referenced field has changed. * Delete task document as part of finalization * Accurately report recoverable errors on create_signals_migration route If we have a recoverable error: e.g. the destination index already exists, or a specified index is a write index, we now report those errors as part of the normal 200 response as these do not preclude other specified indices from being migrated. However, if non-signals indices are specified, we do continue to reject the entire request, as that's indicative of misuse of the endpoint.

rylnd added release_note:enhancement v8.0.0 v7.11.0 Team:Detections and Resp Security Detection Response Team labels Dec 2, 2020

rylnd self-assigned this Dec 2, 2020

rylnd force-pushed the signals_migration branch 2 times, most recently from 1263fd1 to accc933 Compare December 4, 2020 19:45

rylnd added 23 commits December 7, 2020 10:49

WIP: basic reindexing works, lots of edge cases and TODOs to tackle

916268e

Add note

e7abd9e

Add version metadata to signals documents

6cc36ae

WIP: Starting over from the ground up

8982776

* Removes obsolete endpoints/functions * Adds endpoint for checking the migration status of signals indices * Adds helper functions to represent the logical pieces of answering that question

Fleshing out upgrade of signals

3f9cef0

* triggers reindex for each index * starts implementing followup endpoint to "finalize" after reindexing is finished

Fleshing out more of the upgrade path

f415f9a

Still moving logic around a bunch.

Pad the version number of our destination migration index

5e49142

Instead of e.g. `.siem-signals-default-000001-r5`, this will generate `.siem-signals-default-000001-r000005`. This shouldn't matter much, but it may make it easier for users at a glance to see the story of each index.

Fleshing out more upgrade finalization

5bf2897

* Verifies that task matches the specified parameters * Verifies that document counts are the same * updates aliases * finalization endpoint requires both source/dest indexes since we can't determine that from the task itself.

Ensure that new signals are generated with an appropriate schema_version

36bf30f

Move more logic into component functions

7fad15f

Fix type errors

3be0e86

Add some JSDoc comments around our new functions

d6ce594

This is as much to get my thoughts in order as it is for posterity. Next: tests!

Adds integration tests around migration status route

8719981

* Adds io-ts schema for route params * Adds es_archiver data to represent an outdated signals index

WIP: Fleshing out finalization tests

3e2e765

Implement encoding of migration details

dd0d6c3

This will be a slightly better API: rather than having to pass all three fields to finalize the migration, API users can instead send the token.

Finishing integration tests around finalization endpoint

00aef89

This lead to a few changes in the responses from our different endpoints; mainly, we pass both the migration token AND its constituent parts to aid in debugging.

Test an error case due to a reindexing failure

024bed2

This would be really hard to reproduce with an integration test since we'd need to generate a specific reindex failure. Much easier to stub some ES calls to exercise that code in a unit test.

rylnd added 8 commits December 7, 2020 10:49

Enrich our migration_status endpoint with an is_outdated qualification

bc151de

This can be determined programatically, but for users manually interpreting this response, the qualification will help.

Update migration scripts

12d1d0f

More uniform version checking

8474920

* getIndexVersion always returns a number * version comparisons use isOutdated

Fix signal generation unit tests

3a590fc

We now generate a version field to indicate the version under which the signal was created/migrated.

Support reindex options to be sent to create_migration endpoint

d31c13e

Rather than having to perform a manual reindex, this should give API users some control over the performance of their automated migration.

Fix signal generation integration tests

095cce5

These were failing on our new signal field.

Add unit tests for getMigrationStatus

2df6e24

Add a basic test for getSignalsIndicesInRange

70031ad

Since this is ultimately just an aggregation query there's not much else to test.

rylnd force-pushed the signals_migration branch from 1d1d674 to 70031ad Compare December 7, 2020 16:50

rylnd added 2 commits December 7, 2020 13:43

Add unit test for the naming of our destination migration index

d0c2634

Merge branch 'master' into signals_migration

63fb0e9

Conflicts: x-pack/plugins/security_solution/server/lib/detection_engine/routes/index/get_signals_template.ts

rylnd marked this pull request as ready for review December 7, 2020 20:05

rylnd requested review from a team as code owners December 7, 2020 20:05

Handle write indices in our migration logic

0701382

* Treat write indices as any other index in migration status endpoint * Migration API rejects requests containing write indices * Migration API rejects requests containing unknown/non-signals indices

rylnd commented Dec 7, 2020

View reviewed changes

marshallmain reviewed Dec 8, 2020

View reviewed changes

rylnd added 5 commits December 8, 2020 13:27

Add original hot phase to migration cleanup policy

3816f9f

Without this phase, ILM gets confused as it tries to move to the delete phase and fails.

Update old comment

1fcb1d1

The referenced field has changed.

Delete task document as part of finalization

45a3af7

Merge branch 'master' into signals_migration

6762afa

marshallmain approved these changes Dec 10, 2020

View reviewed changes

rylnd merged commit fbe4822 into elastic:master Dec 10, 2020

rylnd deleted the signals_migration branch December 10, 2020 19:12

rylnd mentioned this pull request Dec 10, 2020

[7.x] [Security Solution][Detections] Signals Migration API (#84721) #85624

Merged

rylnd mentioned this pull request Dec 11, 2020

[SecuritySolution][Detections] Adds SavedObject persistence to Signals Migrations #85690

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] Signals Migration API #84721

[Security Solution][Detections] Signals Migration API #84721

rylnd commented Dec 2, 2020 •

edited

Loading

rylnd Dec 7, 2020

marshallmain Dec 8, 2020

rylnd Dec 8, 2020

marshallmain left a comment

marshallmain Dec 8, 2020

rylnd Dec 8, 2020

rylnd Dec 8, 2020

kibanamachine commented Dec 9, 2020

marshallmain left a comment

rylnd commented Dec 10, 2020

[Security Solution][Detections] Signals Migration API #84721

[Security Solution][Detections] Signals Migration API #84721

Conversation

rylnd commented Dec 2, 2020 • edited Loading

Summary

Outstanding questions/notes:

TODO

Checklist

For maintainers

rylnd Dec 7, 2020

Choose a reason for hiding this comment

marshallmain Dec 8, 2020

Choose a reason for hiding this comment

rylnd Dec 8, 2020

Choose a reason for hiding this comment

marshallmain left a comment

Choose a reason for hiding this comment

marshallmain Dec 8, 2020

Choose a reason for hiding this comment

rylnd Dec 8, 2020

Choose a reason for hiding this comment

rylnd Dec 8, 2020

Choose a reason for hiding this comment

kibanamachine commented Dec 9, 2020

💚 Build Succeeded

Metrics [docs]

Distributable file count

Page load bundle

History

marshallmain left a comment

Choose a reason for hiding this comment

rylnd commented Dec 10, 2020

rylnd commented Dec 2, 2020 •

edited

Loading