Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for rule-based anomaly detection and imputation #8202

Merged
merged 39 commits into from
Sep 13, 2024

Conversation

kaituo
Copy link
Contributor

@kaituo kaituo commented Sep 9, 2024

Description

This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings.

Testing done:

  • Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed.

Issues Resolved

closes #8169

Version

2.17+

Frontend features

If you're submitting documentation for an OpenSearch Dashboards feature, add a video that shows how a user will interact with the UI step by step. A voiceover is optional.

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings.

Testing done:
- Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed.

Signed-off-by: Kaituo Li <kaituo@amazon.com>
Copy link

github-actions bot commented Sep 9, 2024

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli added 5 - Editorial review PR: Editorial review in progress and removed 4 - Doc review PR: Doc review in progress labels Sep 10, 2024
@vagimeli vagimeli added 3 - Tech review PR: Tech review in progress and removed 5 - Editorial review PR: Editorial review in progress labels Sep 11, 2024
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Copy edit documentation 

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Doc review complete

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Copy link
Collaborator

@vagimeli vagimeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc review complete. Moving to editorial review.

@vagimeli vagimeli added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Sep 13, 2024
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaituo @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!


Anomaly detection automatically detects anomalies in your OpenSearch data in near real-time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an `anomaly grade` and `confidence score` value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).
Anomaly detection automatically detects anomalies in your OpenSearch data in near real time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an _anomaly grade_ and _confidence score_ value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last sentence: Rather than "Random Cut Forests", it looks like the title of the page is "Robust Random Cut Forest Based Anomaly Detection on Streams".


## Step 1: Define a detector

A detector is an individual anomaly detection task. You can define multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.
A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a sentence here introducing the list.

A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources.

1. On the **Anomaly detection** page, select the **Create detector** button.
2. On the **Define detector** page, enter the required information on the **Detector details** pane.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. On the **Define detector** page, enter the required information on the **Detector details** pane.
2. On the **Define detector** page, enter the required information in the **Detector details** pane.


1. On the **Anomaly detection** page, select the **Create detector** button.
2. On the **Define detector** page, enter the required information on the **Detector details** pane.
3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias.
3. In the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or an alias.


#### Example filter using query DSL
The query is designed to retrieve documents in which the `urlPath.keyword` field matches one of the following specified values:
The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values:
The following example query retrieves documents in which the `urlPath.keyword` field matches any of the specified values:


You can see the following additional fields:
Note that the result includes the following additional field:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the result includes the following additional field:
Note that the result includes the following additional field.

At times, the detector might detect an anomaly late.
Let's say the detector sees a random mix of the triples {1, 2, 3} and {2, 4, 5} that correspond to `slow weeks` and `busy weeks`, respectively. For example 1, 2, 3, 1, 2, 3, 2, 4, 5, 1, 2, 3, 2, 4, 5, ... and so on.
If the detector comes across a pattern {2, 2, X} and it's yet to see X, the detector infers that the pattern is anomalous, but it can't determine at this point which of the 2's is the cause. If X = 3, then the detector knows it's the first 2 in that unfinished triple, and if X = 5, then it's the second 2. If it's the first 2, then the detector detects the anomaly late.
The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late.
The detector may be late in detecting an anomaly. For example: The detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, then the detector infers that the pattern is anomalous. However, it cannot determine which 2 is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector will be late in detecting the anomaly.


If a detector detects an anomaly late, the result has the following additional fields:
When a detector detects an anomaly late, the result includes the following additional fields:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a detector detects an anomaly late, the result includes the following additional fields:
When a detector is late in detecting an anomaly, the result includes the following additional fields.


Field | Description
:--- | :---
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should both instances of "null" be in code font?

`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
`approx_anomaly_start_time` | The approximate time of the actual input that triggered an anomaly. This field helps you understand the time at which a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time at which the detector detects an anomaly can be earlier.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli
Copy link
Collaborator

@natebower @kaituo I've addressed the editorial feedback and revised text that had comments. Do you want to give it another read?

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
1. Choose **Next**.
Using these options can improve recall in anomaly detection. For instance, if you are monitoring for drops in event counts, including both partial and complete drops, then filling missing values with zeros helps detect significant data absences, improving detection recall.

Be cautious when imputing extensively missing data, as excessive gaps can compromise model accuracy. Quality input is critical---poor data quality leads to poor model performance. You can check whether a feature value has been imputed using the `feature_imputed` field in the anomaly results index. See [Anomaly result mapping]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/result-mapping/) for more information.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add "The confidence score decreases when imputations occur."? So there are two signals from imputation: feature_imputed field and confidence score.

@vagimeli vagimeli removed the 5 - Editorial review PR: Editorial review in progress label Sep 13, 2024
Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
@vagimeli vagimeli merged commit 8c74b88 into opensearch-project:main Sep 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Rule-based AD, remote index, imputation
4 participants