[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

dhurley14 · 2020-06-05T01:11:47Z

Summary

Gap detection remediation strategy option 1 from this discussion

When gaps occur in detections we need to make sure that we generate ~~signals~~ alerts during those gaps, if such alerts exist. In order to do this I have added a function to determine how many discrete time periods (not exceeding the rule interval) to search, given a gap duration. We then iterate over these time period 'tuples' to do an exclusive search and bulk create over each gapped rule interval. My hope is this will resolve the issue of not creating alerts during a gap period even if we search over it.

This implementation also caps the number of full rule interval gaps at 4. If a gap equivalent to more than 4 missed rule executions is present, the rule executor will not look at events that occurred more than 4 rule intervals back in search.

For example if a rule runs every 5 minutes, but has not run in the past 25 minutes, we will only look at the last 20 minutes so as to cap how much data we are looking back so we attempt to prevent search timeouts from occurring. Obviously this method is still open to discussion if there are proposals / addendums people think would be good.

Because we are still preventing the executor from searching a gap greater than 4 full rule interval runs, it is still possible even more alerts will be missed. So we are keeping the alert in the "failed" state when a gap is detected in order to alert the analyst of the existence of the gap, despite our attempts at resolving the gap. From there the analyst could determine if task manager is overextended or if they just need to manually adjust their cluster settings.

Checklist

Delete any items that are not applicable to this PR.

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenario

For maintainers

This was checked for breaking API changes and was labeled appropriately

MikePaquette · 2020-06-09T12:21:42Z

@dhurley Great to see this!

Just to confirm, this change is not performing any "chunking" of queries during the "catch-up" period, correct?

Thus the adjustment to max_signals is calculated as an average across the catch-up period. In your example, a 1 minute gap on a 5 minute interval will result in a 6-minute query with a max_signals that is 1.2x the original value.

Likewise, a 15-minute gap on a 5-minute interval would result in a 20-minute query with a max_signals that is 4x its original value.

However, when the 20-minute query runs, the max_signals is not spread across the 4 intervals, but is applied based on the query results from newest to oldest, and if the 4x value of max_signals is reached before the end of, say, the first interval, then the user might see 4x max_signals signals in one time interval, correct?

While this is not ideal, it is OK, since our first priority is to eliminate gaps that could cause missed detections, and in the case where there are large number of detections, we have accomplished that.

Are there plans to take any specific action if the catch-up query times out or otherwise fails?

dhurley14 · 2020-06-09T16:43:05Z

@MikePaquette currently we do “chunk” by 100 events per search, but you are correct in that the implementation in this PR could hit the new calculated max signals before reaching the events which occurred within the “gap” period, despite the changes made to perform a search over the gapped period. So with that case in mind, a new time-based “chunking” could be written to modify not only the from parameter in the search_after but the to as well, to ensure a search is explicitly performed on the missed gap(s).

An example of this new time-based chunking (for this example assume rule interval every 5 minutes and 1 minute gap occurs) would perform two searches:

one 'from': '5m-6m' and 'to': 'now-5m' (with a max_signals of 100 for this search) which covers the 1 minute "gap".
and then the normal interval search for the example rule of 'from': 'now-5m' and 'to': 'now'.

Juxtapose this with the current implementation in this PR which would yield search from and to parameters of 'from': 'now-6m' and 'to':'now'.

I think this would solve the case you mention above but let me know if I'm misunderstanding.

As far as timeouts related to the search taking too long, I don't think there is much in the way I can do about that situation. I think the only thing we have right now is setting the rule status state to ‘failed’ with the error message that the search timed out.

MikePaquette · 2020-06-09T18:34:18Z

@dhurley14 yes, what you proposed above would address the concern. If it is not too difficult or risky, then I'd vote to go with the updated approach. Thanks!

elasticmachine · 2020-06-24T13:47:13Z

Pinging @elastic/siem (Team:SIEM)

dhurley14 · 2020-06-26T13:12:38Z

jenkins test this

rylnd

@dhurley14 I checked out behavior by changing newFrom as advised in the comments. I did not test intricacies involved with gap detection other than to verify rules continue to run and generate signals in the case of a gap. Unit tests look good though so 👍 ; let me know if there are interesting cases you’d like specific review on.

Regarding rule status in the case of a gap error, I was reminded of #62383 which explains our current behavior there 😉

rylnd · 2020-06-29T16:49:21Z

...ck/plugins/security_solution/server/lib/detection_engine/signals/search_after_bulk_create.ts

+      try {
+        logger.debug(`sortIds: ${sortId}`);
+        const {
+          // @ts-ignore https://github.com/microsoft/TypeScript/issues/35546


I'm not getting an error if I remove this line; what are you seeing?

rylnd · 2020-06-29T16:52:57Z

...ck/plugins/security_solution/server/lib/detection_engine/signals/search_after_bulk_create.ts

+          searchDuration,
+        }: { searchResult: SignalSearchResponse; searchDuration: string } = await singleSearchAfter(
+          {
+            // @ts-ignore we are using sortId before being assigned but that's ok.


Ditto here; I'm not seeing an error.

mm yeah this might have been leftover from when I was testing things. It was complaining I was using sortId before assignment but that is not the case anymore. thanks!

rylnd · 2020-06-29T16:56:54Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/filter_events_with_list.ts

@@ -28,7 +28,9 @@ export const filterEventsAgainstList = async ({
  eventSearchResult,
 }: FilterEventsAgainstList): Promise<SignalSearchResponse> => {
  try {
+    logger.debug(`exceptionsList: ${JSON.stringify(exceptionsList, null, 4)}`);


Nit: two spaces for indentation

…e missed when gap in consecutive rule runs is detected

…r diff, adds calculatedFrom to the search after query

… so i removed one of them

…a better way to test this

…ill need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests.

…we guarantee maxSignals per full rule interval. Needs some refactoring though.

…intervals for searching to occur

…its but we were accessing property on non-existent hit item

… lookback time, also fixes a bug where the search and bulk loop would return false when successful.

…nts, adds tests

kibanamachine · 2020-06-30T20:04:28Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 0d36ebe

Build metrics

✅ unchanged

History

💚 Build #57556 succeeded 7a6ada6
💚 Build #56662 succeeded 660fa98
💔 Build #56456 failed 660fa98
💔 Build #56400 failed 56a933b
💔 Build #56354 failed 7bbb620

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…ed (elastic#68339) * add POC logic to modify the 'from' param in the search * fixes formatting for appending gap diff to from * computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected * adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query * remove console.log and for some reason two eslint disables were added so i removed one of them * rename variables, add test based on log message - need to figure out a better way to test this * remove unused import * fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests. * updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though. * move logic into utils function, utils function still needs refactoring * adds unit tests and cleans up new util function for determining time intervals for searching to occur * more code cleanup * remove more logging statements * fix type errors * updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item * fix rebase conflict * fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful. * gap is a duration, not a number. * remove logging variable * remove logging function from test * fix type import from rebase with master * updates missed test when rebased with master, removes unused import * modify log statements to include meta information for logged rule events, adds tests * remove unnecessary ts-ignores * indentation on stringify * adds a test to ensure we are parsing the elapsed time correctly

…detected (#68339) (#70371) * add POC logic to modify the 'from' param in the search * fixes formatting for appending gap diff to from * computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected * adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query * remove console.log and for some reason two eslint disables were added so i removed one of them * rename variables, add test based on log message - need to figure out a better way to test this * remove unused import * fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests. * updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though. * move logic into utils function, utils function still needs refactoring * adds unit tests and cleans up new util function for determining time intervals for searching to occur * more code cleanup * remove more logging statements * fix type errors * updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item * fix rebase conflict * fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful. * gap is a duration, not a number. * remove logging variable * remove logging function from test * fix type import from rebase with master * updates missed test when rebased with master, removes unused import * modify log statements to include meta information for logged rule events, adds tests * remove unnecessary ts-ignores * indentation on stringify * adds a test to ensure we are parsing the elapsed time correctly

elasticmachine · 2021-09-23T14:33:11Z

Pinging @elastic/security-solution (Team: SecuritySolution)

dhurley14 force-pushed the gap-inc-look-back branch 2 times, most recently from db56b9a to 19c9f35 Compare June 24, 2020 01:49

dhurley14 self-assigned this Jun 24, 2020

dhurley14 added release_note:enhancement review Team:SIEM v7.9.0 v8.0.0 labels Jun 24, 2020

dhurley14 marked this pull request as ready for review June 24, 2020 13:47

dhurley14 requested review from a team as code owners June 24, 2020 13:47

dhurley14 force-pushed the gap-inc-look-back branch 2 times, most recently from 00fd85a to 56a933b Compare June 25, 2020 16:44

rylnd approved these changes Jun 29, 2020

View reviewed changes

dhurley14 added 11 commits June 30, 2020 11:17

add POC logic to modify the 'from' param in the search

a023280

fixes formatting for appending gap diff to from

8ce0f95

computes new max signals based on how many intervals of rule runs wer…

e0db4e5

…e missed when gap in consecutive rule runs is detected

adds logging, fixes bug where we could end up with negative values fo…

4a85ba0

…r diff, adds calculatedFrom to the search after query

remove console.log and for some reason two eslint disables were added…

9c0e825

… so i removed one of them

rename variables, add test based on log message - need to figure out …

4670f6b

…a better way to test this

remove unused import

4ce4a25

fully re-worked the algorithm for searching discrete time periods, st…

3c9587c

…ill need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests.

updated loop to include maxSignals per time interval tuple, this way …

889382a

…we guarantee maxSignals per full rule interval. Needs some refactoring though.

move logic into utils function, utils function still needs refactoring

b336202

adds unit tests and cleans up new util function for determining time …

f7b0318

…intervals for searching to occur

dhurley14 added 12 commits June 30, 2020 11:17

more code cleanup

9072963

remove more logging statements

dfecacd

fix type errors

f5111ed

updates unit tests and fixes bug where search result would return 0 h…

28fd2fe

…its but we were accessing property on non-existent hit item

fix rebase conflict

b47318a

fixes a bug where a negative gap could exist if a rule ran before the…

59a254a

… lookback time, also fixes a bug where the search and bulk loop would return false when successful.

gap is a duration, not a number.

2c201f6

remove logging variable

61e3fae

remove logging function from test

c681b09

fix type import from rebase with master

3898a80

updates missed test when rebased with master, removes unused import

c1b44da

modify log statements to include meta information for logged rule eve…

eadef46

…nts, adds tests

dhurley14 force-pushed the gap-inc-look-back branch from 06dd84e to eadef46 Compare June 30, 2020 15:24

dhurley14 added 3 commits June 30, 2020 11:35

remove unnecessary ts-ignores

7a6ada6

indentation on stringify

453f514

adds a test to ensure we are parsing the elapsed time correctly

0d36ebe

dhurley14 merged commit 432f93a into elastic:master Jun 30, 2020

dhurley14 deleted the gap-inc-look-back branch June 30, 2020 20:43

dhurley14 mentioned this pull request Jun 30, 2020

[7.x] [SECURITY SOLUTION] [Detections] Increase lookback when gap is detected (#68339) #70371

Merged

dhurley14 mentioned this pull request Jul 2, 2020

[SIEM] [Detections] Gap detection mitigation and remediation summary #63290

Closed

dhurley14 mentioned this pull request Jul 16, 2020

[SIEM] [Detections] Fixes faulty circuit breaker #71999

Merged

3 tasks

MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021

spong mentioned this pull request Feb 10, 2023

Add details around rule execution expectations when performing upgrades to Upgrade Elastic Security docs elastic/security-docs#2964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

dhurley14 commented Jun 5, 2020 •

edited

Loading

MikePaquette commented Jun 9, 2020

dhurley14 commented Jun 9, 2020 •

edited

Loading

MikePaquette commented Jun 9, 2020

elasticmachine commented Jun 24, 2020

dhurley14 commented Jun 26, 2020

rylnd left a comment

rylnd Jun 29, 2020

rylnd Jun 29, 2020

dhurley14 Jun 30, 2020

rylnd Jun 29, 2020

kibanamachine commented Jun 30, 2020

elasticmachine commented Sep 23, 2021

[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339

Conversation

dhurley14 commented Jun 5, 2020 • edited Loading

Summary

Checklist

For maintainers

MikePaquette commented Jun 9, 2020

dhurley14 commented Jun 9, 2020 • edited Loading

MikePaquette commented Jun 9, 2020

elasticmachine commented Jun 24, 2020

dhurley14 commented Jun 26, 2020

rylnd left a comment

Choose a reason for hiding this comment

rylnd Jun 29, 2020

Choose a reason for hiding this comment

rylnd Jun 29, 2020

Choose a reason for hiding this comment

dhurley14 Jun 30, 2020

Choose a reason for hiding this comment

rylnd Jun 29, 2020

Choose a reason for hiding this comment

kibanamachine commented Jun 30, 2020

💚 Build Succeeded

Build metrics

History

elasticmachine commented Sep 23, 2021

dhurley14 commented Jun 5, 2020 •

edited

Loading

dhurley14 commented Jun 9, 2020 •

edited

Loading