Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor invocation of Action listeners in correlations #880

Merged
merged 3 commits into from
Mar 6, 2024

Conversation

goyamegh
Copy link
Collaborator

@goyamegh goyamegh commented Mar 4, 2024

Description

This PR is intended to fix the hanging tasks observed in _cat/tasks by refactoring the correlation workflow to ensure timely closure of parent action listeners upon successful completion or encountering exceptions, and consolidating exception handling logic into a centralized function. The aim is to optimize task management efficiency and enhance the overall reliability of our system.

This logic has been tested against a high indexing workload ( approx. 1 M docs/ minute) where the issue was observed prominently in a cluster of 3 or more data nodes, and generating findings with the help of a Cloudtrail logs detector running all 32 pre-packaged rules at a frequency of 1 minute. Further, the correlations were generated with the help of a single rule on the same log type for testing, where the findings are generated at a rate of 1~2k per minute.

Issues Resolved

#879

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Mar 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 648 lines in your changes are missing coverage. Please review.

Project coverage is 25.04%. Comparing base (172d58d) to head (612b1ca).
Report is 1 commits behind head on main.

Files Patch % Lines
...yanalytics/correlation/VectorEmbeddingsEngine.java 0.00% 282 Missing ⚠️
...arch/securityanalytics/correlation/JoinEngine.java 0.00% 191 Missing ⚠️
...ics/transport/TransportCorrelateFindingAction.java 0.00% 173 Missing ⚠️
...arch/securityanalytics/logtype/LogTypeService.java 0.00% 0 Missing and 1 partial ⚠️
...rch/securityanalytics/util/CorrelationIndices.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #880      +/-   ##
============================================
+ Coverage     24.79%   25.04%   +0.25%     
- Complexity     1026     1029       +3     
============================================
  Files           277      277              
  Lines         12702    12579     -123     
  Branches       1394     1373      -21     
============================================
+ Hits           3149     3151       +2     
+ Misses         9288     9164     -124     
+ Partials        265      264       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Iterator<SearchHit> hits = response.getHits().iterator();
List<CorrelationRule> correlationRules = new ArrayList<>();
while (hits.hasNext()) {
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Try-catch is redundant as we are using ActionListener.wrap() which catches generic Exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

CorrelationRule rule = CorrelationRule.parse(xcp, hit.getId(), hit.getVersion());
correlationRules.add(rule);
} catch (IOException e) {
onFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this catch block is wrongly placed as after onFailure() we will continue iterating the loop. plz remove this try catch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, that's not true. Once an exception is raised, the loop will break for this thread. Since we are closing the parent listener, so we should be good.

Anyway, I removed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-catch was within for loop so the flow of control would have returned to this thread


CorrelationRule rule = CorrelationRule.parse(xcp, hit.getId(), hit.getVersion());
correlationRules.add(rule);
} catch (IOException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add error log

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onFailure() calls TransportCorrelateFindings.onFailures(), which will make sure that we log this, and that too once in the lifetime of the task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to log at generic failure handling code blocks. we should log in the method so that we are able to communicate where the exception came from or pass a custom error log message to on failure (i would prefer second approach but that would require a lot more change in impl. so was suggesting simple error log. )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. For now, I have logged the following information for now:

  1. Exception and stack trace
  2. Monitor and finding id of that request
  3. Trace should tell us where the exception originated from.

I'm not reverting this for now, as we are missing logging completely at this point. As a long term, I created a git hub issue to take this up: #883

getValidDocuments(detectorType, indices, correlationRules, relatedDocIds, autoCorrelations);
client.search(searchRequest, ActionListener.wrap(response -> {
if (response.isTimedOut()) {
onFailure(new OpenSearchStatusException("Search request timed out", RestStatus.REQUEST_TIMEOUT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we throw this exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onFailure() will inadvertently throw the exception in finishHim() function in TransportCorrelateFindingsAction.java

});
getValidDocuments(detectorType, indices, correlationRules, relatedDocIds, autoCorrelations);
}, e -> {
log.error("[CORRELATIONS] Exception encountered while searching correlation rule index for finding id {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actionListener.onFailure() operation is not surrounded by try-catch as it is terminal unlike onResponse which catches Generic exception. you need to add a try catch if you are not directly invoking onfailure.

plz add try catch here and anywhere else where you have some additional business logic in failure consumer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added in getTimesStampFeature() too.

response.getResponse().getHits().getHits(), validFields.get(idx)));
}
++idx;
if (response.getResponse().getHits().getTotalHits().value > 0L) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be checking for hits.length here also?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

onFailures(e);
SearchHits hits = response.getHits();
// Detectors Index hits count could be more even if we fetch one
if (hits.getTotalHits().value >= 1 && hits.getHits().length > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check looks different in every place.
Why cant we just iterate hits with a for loop and not do any of these checks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that the usage is different everywhere. This usage only expects a single doc and probably why it was written like this. I can take a quick pass at all usages of getTotalHits() in a follow-up and fix this everywhere to make it consistent.

Removing the check on total hits entirely.

onFailures(e);
}
} else {
onFailures(new OpenSearchStatusException("detector not found given monitor id", RestStatus.INTERNAL_SERVER_ERROR));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the monitor id

}
correlationIndices.setupCorrelationIndex(indexTimeout, setupTimestamp, ActionListener.wrap(bulkResponse -> {
if (bulkResponse.hasFailures()) {
log.error(new OpenSearchStatusException(bulkResponse.toString(), RestStatus.INTERNAL_SERVER_ERROR));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onfailure()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't been following this practice for most bulk requests. But, I checked that the setupCorrelationIndex() is simply indexing two docs, which are later needed as well. Modifying this behavior to fail here itself. @sbcd90 please verify. This is one of the cases where we observed exceptions on, metadata index not having the right docs.

Rectifying this for other calls too.

});
}
client.search(searchMetadataIndexRequest, ActionListener.wrap(searchMetadataResponse -> {
String id = searchMetadataResponse.getHits().getHits()[0].getId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size check on hits array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}, this::onFailures));
}, this::onFailures));
} catch (Exception ex) {
onFailures(ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
getValidDocuments(detectorType, indices, correlationRules, relatedDocIds, autoCorrelations);
}, e -> {
try {
log.error("[CORRELATIONS] Exception encountered while searching correlation rule index for finding id {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we should remove the [CORRELATIONS] prefix

@goyamegh goyamegh merged commit ec0657d into opensearch-project:main Mar 6, 2024
11 of 18 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-880-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ec0657d74a3b147f304e5985250f0e3d8e0e3e4b
# Push it to GitHub
git push --set-upstream origin backport/backport-880-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-880-to-2.x.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.11 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.11 2.11
# Navigate to the new working tree
cd .worktrees/backport-2.11
# Create a new branch
git switch --create backport/backport-880-to-2.11
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ec0657d74a3b147f304e5985250f0e3d8e0e3e4b
# Push it to GitHub
git push --set-upstream origin backport/backport-880-to-2.11
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.11

Then, create a pull request where the base branch is 2.11 and the compare/head branch is backport/backport-880-to-2.11.

goyamegh added a commit to goyamegh/security-analytics that referenced this pull request Mar 8, 2024
…roject#880)

* Refactor invocation of Action listeners in correlations

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Close hanging tasks in correlations workflow

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Logging finding id and monitor id in error logs

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
goyamegh added a commit that referenced this pull request Mar 8, 2024
* Refactor invocation of Action listeners in correlations



* Close hanging tasks in correlations workflow



* Logging finding id and monitor id in error logs



---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
sbcd90 pushed a commit to sbcd90/security-analytics that referenced this pull request Mar 10, 2024
Signed-off-by: Joanne Wang <jowg@amazon.com>
(cherry picked from commit 4d4f5e3)

Co-authored-by: Joanne Wang <jowg@amazon.com>

Reduce log level for informative message (opensearch-project#203) (opensearch-project#833)

Signed-off-by: Enrico Tröger <enrico.troeger@uvena.de>
Co-authored-by: Enrico Tröger <enrico.troeger@uvena.de>

Updated alert creation following common-utils PR 584. (opensearch-project#837) (opensearch-project#839)

Signed-off-by: AWSHurneyt <hurneyt@amazon.com>
(cherry picked from commit 8adb9c3)

Co-authored-by: AWSHurneyt <hurneyt@amazon.com>

Release notes for 2.12.0 (opensearch-project#834) (opensearch-project#841)

* release notes for 2.12

Signed-off-by: Joanne Wang <jowg@amazon.com>

* update release notes

Signed-off-by: Joanne Wang <jowg@amazon.com>

* update release notes

Signed-off-by: Joanne Wang <jowg@amazon.com>

---------

Signed-off-by: Joanne Wang <jowg@amazon.com>
(cherry picked from commit 414484a)

Co-authored-by: Joanne Wang <jowg@amazon.com>

Remove blocking calls and change threat intel feed flow to event driven (opensearch-project#871) (opensearch-project#876)

* remove actionGet() and change threat intel feed flow to event driven

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* fix javadocs

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* revert try catch removals

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* use action listener wrap() in detector threat intel code paths

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* add try catch

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

---------

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
(cherry picked from commit 172d58d)

Co-authored-by: Surya Sashank Nistala <snistala@amazon.com>

Fail the flow the when detectot type is missing in the log types index (opensearch-project#845) (opensearch-project#857)

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit 8d19912)

Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com>

[BUG] ArrayIndexOutOfBoundsException for inconsistent detector index behavior  (opensearch-project#843) (opensearch-project#858)

* Catch ArrayIndexOutOfBoundsException when detector is missing

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Add a check on SearchHits.getHits() length

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Remove index out of bounds exception

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit 0ef8543)

Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com>

Backport opensearch-project#873 and opensearch-project#789 (opensearch-project#895)

* support object fields in aggregation based sigma rules (opensearch-project#789)

Signed-off-by: Subhobrata Dey <sbcd90@gmail.com>

* Pass rule field names in doc level queries during monitor/creation. Remove blocking actionGet() calls  (opensearch-project#873)

* pass query field names in doc level queries during monitor creation/updation

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* remove actionGet() and change get index mapping call to event driven flow

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* fix chained findings monitor

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* add finding mappings

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* remove test messages from logs

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* revert build.gradle change

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

---------

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

---------

Signed-off-by: Subhobrata Dey <sbcd90@gmail.com>
Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Co-authored-by: Subhobrata Dey <sbcd90@gmail.com>

Fix duplicate ecs mappings which returns incorrect log index field in mapping view API (opensearch-project#786) (opensearch-project#788) (opensearch-project#898)

* field mapping changes

* add integ test

* turn unmappedfieldaliases as set and add integ test

* add comments

* fix integ tests

* moved logic to method for better readability

---------

Signed-off-by: Joanne Wang <jowg@amazon.com>

Add throw for empty strings in rules with modifier contains, startwith, and endswith (opensearch-project#860) (opensearch-project#896)

* add validation for empty strings with contains, startswith and endswith modifiers

* throw exception if empty string with contains, startswith, or endswith

* change var name

* add modifiers to log

---------

Signed-off-by: Joanne Wang <jowg@amazon.com>

Add an "exists" check for "not" condition in sigma rules (opensearch-project#852) (opensearch-project#897)

* test design

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working version

Signed-off-by: Joanne Wang <jowg@amazon.com>

* cleaning up

Signed-off-by: Joanne Wang <jowg@amazon.com>

* testing

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working version

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working version

Signed-off-by: Joanne Wang <jowg@amazon.com>

* refactored querybackend

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working on tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* fixed alerting and finding tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* fix correlation tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* working all tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* moved test and changed alias for adldap

Signed-off-by: Joanne Wang <jowg@amazon.com>

* added more tests

Signed-off-by: Joanne Wang <jowg@amazon.com>

* cleanup code

Signed-off-by: Joanne Wang <jowg@amazon.com>

* remove exists flag

Signed-off-by: Joanne Wang <jowg@amazon.com>

---------

Signed-off-by: Joanne Wang <jowg@amazon.com>
(cherry picked from commit 656a5fe)

Co-authored-by: Joanne Wang <jowg@amazon.com>

Add goyamegh as a maintainer (opensearch-project#868) (opensearch-project#899)

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

Refactor invocation of Action listeners in correlations (opensearch-project#880) (opensearch-project#900)

* Refactor invocation of Action listeners in correlations

* Close hanging tasks in correlations workflow

* Logging finding id and monitor id in error logs

---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

Add search request timeouts for correlations workflows (opensearch-project#893) (opensearch-project#901)

* Reinstating more leaks plugged-in for correlations workflows

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Add search timeouts to all correlation searches

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Fix logging and exception messages

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Change search timeout to 30 seconds

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit 75c4429)

Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com>
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.9 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.9 2.9
# Navigate to the new working tree
cd .worktrees/backport-2.9
# Create a new branch
git switch --create backport/backport-880-to-2.9
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ec0657d74a3b147f304e5985250f0e3d8e0e3e4b
# Push it to GitHub
git push --set-upstream origin backport/backport-880-to-2.9
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.9

Then, create a pull request where the base branch is 2.9 and the compare/head branch is backport/backport-880-to-2.9.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.7 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.7 2.7
# Navigate to the new working tree
cd .worktrees/backport-2.7
# Create a new branch
git switch --create backport/backport-880-to-2.7
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ec0657d74a3b147f304e5985250f0e3d8e0e3e4b
# Push it to GitHub
git push --set-upstream origin backport/backport-880-to-2.7
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.7

Then, create a pull request where the base branch is 2.7 and the compare/head branch is backport/backport-880-to-2.7.

riysaxen-amzn pushed a commit that referenced this pull request Mar 18, 2024
* Refactor invocation of Action listeners in correlations

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Close hanging tasks in correlations workflow

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

* Logging finding id and monitor id in error logs

Signed-off-by: Megha Goyal <goyamegh@amazon.com>

---------

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
riysaxen-amzn pushed a commit to riysaxen-amzn/security-analytics that referenced this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants