-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SECURITY SOLUTION] [Detections] Increase lookback when gap is detected #68339
Conversation
@dhurley Great to see this! Just to confirm, this change is not performing any "chunking" of queries during the "catch-up" period, correct? Thus the adjustment to Likewise, a 15-minute gap on a 5-minute interval would result in a 20-minute query with a However, when the 20-minute query runs, the While this is not ideal, it is OK, since our first priority is to eliminate gaps that could cause missed detections, and in the case where there are large number of detections, we have accomplished that. Are there plans to take any specific action if the catch-up query times out or otherwise fails? |
@MikePaquette currently we do “chunk” by 100 events per search, but you are correct in that the implementation in this PR could hit the new calculated max signals before reaching the events which occurred within the “gap” period, despite the changes made to perform a search over the gapped period. So with that case in mind, a new time-based “chunking” could be written to modify not only the An example of this new time-based chunking (for this example assume rule interval every 5 minutes and 1 minute gap occurs) would perform two searches:
Juxtapose this with the current implementation in this PR which would yield search from and to parameters of I think this would solve the case you mention above but let me know if I'm misunderstanding. As far as timeouts related to the search taking too long, I don't think there is much in the way I can do about that situation. I think the only thing we have right now is setting the rule status state to ‘failed’ with the error message that the search timed out. |
@dhurley14 yes, what you proposed above would address the concern. If it is not too difficult or risky, then I'd vote to go with the updated approach. Thanks! |
db56b9a
to
19c9f35
Compare
Pinging @elastic/siem (Team:SIEM) |
00fd85a
to
56a933b
Compare
jenkins test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhurley14 I checked out behavior by changing newFrom
as advised in the comments. I did not test intricacies involved with gap detection other than to verify rules continue to run and generate signals in the case of a gap. Unit tests look good though so 👍 ; let me know if there are interesting cases you’d like specific review on.
Regarding rule status in the case of a gap error, I was reminded of #62383 which explains our current behavior there 😉
try { | ||
logger.debug(`sortIds: ${sortId}`); | ||
const { | ||
// @ts-ignore https://github.com/microsoft/TypeScript/issues/35546 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not getting an error if I remove this line; what are you seeing?
searchDuration, | ||
}: { searchResult: SignalSearchResponse; searchDuration: string } = await singleSearchAfter( | ||
{ | ||
// @ts-ignore we are using sortId before being assigned but that's ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here; I'm not seeing an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm yeah this might have been leftover from when I was testing things. It was complaining I was using sortId before assignment but that is not the case anymore. thanks!
@@ -28,7 +28,9 @@ export const filterEventsAgainstList = async ({ | |||
eventSearchResult, | |||
}: FilterEventsAgainstList): Promise<SignalSearchResponse> => { | |||
try { | |||
logger.debug(`exceptionsList: ${JSON.stringify(exceptionsList, null, 4)}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: two spaces for indentation
…e missed when gap in consecutive rule runs is detected
…r diff, adds calculatedFrom to the search after query
… so i removed one of them
…a better way to test this
…ill need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests.
…we guarantee maxSignals per full rule interval. Needs some refactoring though.
…intervals for searching to occur
…its but we were accessing property on non-existent hit item
… lookback time, also fixes a bug where the search and bulk loop would return false when successful.
06dd84e
to
eadef46
Compare
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
…ed (elastic#68339) * add POC logic to modify the 'from' param in the search * fixes formatting for appending gap diff to from * computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected * adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query * remove console.log and for some reason two eslint disables were added so i removed one of them * rename variables, add test based on log message - need to figure out a better way to test this * remove unused import * fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests. * updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though. * move logic into utils function, utils function still needs refactoring * adds unit tests and cleans up new util function for determining time intervals for searching to occur * more code cleanup * remove more logging statements * fix type errors * updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item * fix rebase conflict * fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful. * gap is a duration, not a number. * remove logging variable * remove logging function from test * fix type import from rebase with master * updates missed test when rebased with master, removes unused import * modify log statements to include meta information for logged rule events, adds tests * remove unnecessary ts-ignores * indentation on stringify * adds a test to ensure we are parsing the elapsed time correctly
…detected (#68339) (#70371) * add POC logic to modify the 'from' param in the search * fixes formatting for appending gap diff to from * computes new max signals based on how many intervals of rule runs were missed when gap in consecutive rule runs is detected * adds logging, fixes bug where we could end up with negative values for diff, adds calculatedFrom to the search after query * remove console.log and for some reason two eslint disables were added so i removed one of them * rename variables, add test based on log message - need to figure out a better way to test this * remove unused import * fully re-worked the algorithm for searching discrete time periods, still need search_after because a user could submit a rule with a custom maxSignals so that would still serve a purpose. This needs heavy refactoring though, and tests. * updated loop to include maxSignals per time interval tuple, this way we guarantee maxSignals per full rule interval. Needs some refactoring though. * move logic into utils function, utils function still needs refactoring * adds unit tests and cleans up new util function for determining time intervals for searching to occur * more code cleanup * remove more logging statements * fix type errors * updates unit tests and fixes bug where search result would return 0 hits but we were accessing property on non-existent hit item * fix rebase conflict * fixes a bug where a negative gap could exist if a rule ran before the lookback time, also fixes a bug where the search and bulk loop would return false when successful. * gap is a duration, not a number. * remove logging variable * remove logging function from test * fix type import from rebase with master * updates missed test when rebased with master, removes unused import * modify log statements to include meta information for logged rule events, adds tests * remove unnecessary ts-ignores * indentation on stringify * adds a test to ensure we are parsing the elapsed time correctly
Pinging @elastic/security-solution (Team: SecuritySolution) |
Summary
Gap detection remediation strategy option 1 from this discussion
When gaps occur in detections we need to make sure that we generate
signalsalerts during those gaps, if such alerts exist. In order to do this I have added a function to determine how many discrete time periods (not exceeding the rule interval) to search, given a gap duration. We then iterate over these time period 'tuples' to do an exclusive search and bulk create over each gapped rule interval. My hope is this will resolve the issue of not creating alerts during a gap period even if we search over it.This implementation also caps the number of full rule interval gaps at 4. If a gap equivalent to more than 4 missed rule executions is present, the rule executor will not look at events that occurred more than 4 rule intervals back in search.
For example if a rule runs every 5 minutes, but has not run in the past 25 minutes, we will only look at the last 20 minutes so as to cap how much data we are looking back so we attempt to prevent search timeouts from occurring. Obviously this method is still open to discussion if there are proposals / addendums people think would be good.
Because we are still preventing the executor from searching a gap greater than 4 full rule interval runs, it is still possible even more alerts will be missed. So we are keeping the alert in the "failed" state when a gap is detected in order to alert the analyst of the existence of the gap, despite our attempts at resolving the gap. From there the analyst could determine if task manager is overextended or if they just need to manually adjust their cluster settings.
Checklist
Delete any items that are not applicable to this PR.
For maintainers