Add documentation for rule-based anomaly detection and imputation #8202

kaituo · 2024-09-09T21:15:27Z

Description

This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings.

Testing done:

Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed.

Issues Resolved

closes #8169

Version

2.17+

Frontend features

If you're submitting documentation for an OpenSearch Dashboards feature, add a video that shows how a user will interact with the UI step by step. A voiceover is optional.

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR introduces new documentation for rule-based anomaly detection (AD) and imputation options, providing detailed guidance on configuring these features. It also updates the maximum shingle size information and enhances the documentation for window delay settings. Testing done: - Successfully ran Jekyll build and reviewed the updated documentation to ensure all changes are correctly displayed. Signed-off-by: Kaituo Li <kaituo@amazon.com>

github-actions · 2024-09-09T21:15:39Z

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Copy edit documentation Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Doc review complete Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli

Doc review complete. Moving to editorial review.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

natebower

@kaituo @vagimeli Please see my comments and changes and let me know if you have any questions. Thanks!

natebower · 2024-09-13T14:18:01Z

_observing-your-data/ad/index.md


-Anomaly detection automatically detects anomalies in your OpenSearch data in near real-time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an `anomaly grade` and `confidence score` value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).
+Anomaly detection automatically detects anomalies in your OpenSearch data in near real time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an _anomaly grade_ and _confidence score_ value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).


Last sentence: Rather than "Random Cut Forests", it looks like the title of the page is "Robust Random Cut Forest Based Anomaly Detection on Streams".

natebower · 2024-09-13T14:19:19Z

_observing-your-data/ad/index.md


 ## Step 1: Define a detector

-A detector is an individual anomaly detection task. You can define multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.
+A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources.


Let's add a sentence here introducing the list.

natebower · 2024-09-13T14:19:42Z

_observing-your-data/ad/index.md

+A _detector_ is an individual anomaly detection task. You can define multiple detectors, and all detectors can run simultaneously, with each analyzing data from different sources.
+
+1. On the **Anomaly detection** page, select the **Create detector** button.
+2. On the **Define detector** page, enter the required information on the **Detector details** pane.


Suggested change

2. On the **Define detector** page, enter the required information on the **Detector details** pane.

2. On the **Define detector** page, enter the required information in the **Detector details** pane.

natebower · 2024-09-13T14:21:01Z

_observing-your-data/ad/index.md

+
+1. On the **Anomaly detection** page, select the **Create detector** button.
+2. On the **Define detector** page, enter the required information on the **Detector details** pane.
+3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias.


Suggested change

3. On the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or alias.

3. In the **Select data** pane, specify the data source by choosing a source from the **Index** dropdown menu. You can choose an index, index patterns, or an alias.

natebower · 2024-09-13T14:22:02Z

_observing-your-data/ad/index.md


-#### Example filter using query DSL
-The query is designed to retrieve documents in which the `urlPath.keyword` field matches one of the following specified values:
+The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values:


Suggested change

The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values:

The following example query retrieves documents in which the `urlPath.keyword` field matches any of the specified values:

natebower · 2024-09-13T15:08:55Z

_observing-your-data/ad/result-mapping.md


-You can see the following additional fields:
+Note that the result includes the following additional field: 


Suggested change

Note that the result includes the following additional field:

Note that the result includes the following additional field.

natebower · 2024-09-13T15:10:56Z

_observing-your-data/ad/result-mapping.md

-At times, the detector might detect an anomaly late.
-Let's say the detector sees a random mix of the triples {1, 2, 3} and {2, 4, 5} that correspond to `slow weeks` and `busy weeks`, respectively. For example 1, 2, 3, 1, 2, 3, 2, 4, 5, 1, 2, 3, 2, 4, 5, ... and so on.
-If the detector comes across a pattern {2, 2, X} and it's yet to see X, the detector infers that the pattern is anomalous, but it can't determine at this point which of the 2's is the cause. If X = 3, then the detector knows it's the first 2 in that unfinished triple, and if X = 5, then it's the second 2. If it's the first 2, then the detector detects the anomaly late.
+The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late.


Suggested change

The detector may detect an anomaly late. For example, the detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, the detector infers that the pattern is anomalous. However, it cannot determine which of the 2's is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector would detect the anomaly late.

The detector may be late in detecting an anomaly. For example: The detector observes a sequence of data that alternates between "slow weeks" (represented by the triples {1, 2, 3}) and "busy weeks" (represented by the triples {2, 4, 5}). If the detector comes across a pattern {2, 2, X}, where it has not yet seen the value that X will take, then the detector infers that the pattern is anomalous. However, it cannot determine which 2 is the cause. If X = 3, then the first 2 is the anomaly. If X = 5, then the second 2 is the anomaly. If it is the first 2, then the detector will be late in detecting the anomaly.

natebower · 2024-09-13T15:11:29Z

_observing-your-data/ad/result-mapping.md


-If a detector detects an anomaly late, the result has the following additional fields:
+When a detector detects an anomaly late, the result includes the following additional fields:


Suggested change

When a detector detects an anomaly late, the result includes the following additional fields:

When a detector is late in detecting an anomaly, the result includes the following additional fields.

natebower · 2024-09-13T15:12:18Z

_observing-your-data/ad/result-mapping.md


 Field | Description
 :--- | :---
-`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
-`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
+`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).


Should both instances of "null" be in code font?

natebower · 2024-09-13T15:13:36Z

_observing-your-data/ad/result-mapping.md

-`past_values` | The actual input that triggered an anomaly. If `past_values` is null, the attributions or expected values are from the current input. If `past_values` is not null, the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
-`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors don't query previous anomaly results because these queries are expensive operations. The cost is especially high for high-cardinality detectors that might have a lot of entities. If the data is not continuous, the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.
+`past_values` | The actual input that triggered an anomaly. If `past_values` is null, then the attributions or expected values are from the current input. If `past_values` is not null, then the attributions or expected values are from a past input (for example, the previous two steps of the data [1,2,3]).
+`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.


Suggested change

`approx_anomaly_start_time` | The approximate time of the actual input that triggers an anomaly. This field helps you understand when a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time that the detector detects an anomaly can be earlier.

`approx_anomaly_start_time` | The approximate time of the actual input that triggered an anomaly. This field helps you understand the time at which a detector flags an anomaly. Both single-stream and high-cardinality detectors do not query previous anomaly results because these queries are costly operations. The cost is especially high for high-cardinality detectors that may have many entities. If the data is not continuous, then the accuracy of this field is low and the actual time at which the detector detects an anomaly can be earlier.

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli · 2024-09-13T17:49:46Z

@natebower @kaituo I've addressed the editorial feedback and revised text that had comments. Do you want to give it another read?

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

kaituo · 2024-09-13T21:30:01Z

_observing-your-data/ad/index.md

-1. Choose **Next**.
+Using these options can improve recall in anomaly detection. For instance, if you are monitoring for drops in event counts, including both partial and complete drops, then filling missing values with zeros helps detect significant data absences, improving detection recall.
+
+Be cautious when imputing extensively missing data, as excessive gaps can compromise model accuracy. Quality input is critical---poor data quality leads to poor model performance. You can check whether a feature value has been imputed using the `feature_imputed` field in the anomaly results index. See [Anomaly result mapping]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/result-mapping/) for more information.


Can you also add "The confidence score decreases when imputations occur."? So there are two signals from imputation: feature_imputed field and confidence score.

_observing-your-data/ad/index.md

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

kaituo requested review from kolchfa-aws, Naarcha-AWS, vagimeli, AMoo-Miki, natebower, dlvenable, stephen-crawford and epugh as code owners September 9, 2024 21:15

github-actions bot assigned kolchfa-aws Sep 9, 2024

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws Sep 9, 2024

vagimeli added 4 - Doc review PR: Doc review in progress v2.17.0 labels Sep 10, 2024

kaituo mentioned this pull request Sep 10, 2024

Adding documentation for remote index use in AD #8191

Merged

1 task

Doc review

b2af679

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli added 5 - Editorial review PR: Editorial review in progress and removed 4 - Doc review PR: Doc review in progress labels Sep 10, 2024

vagimeli requested changes Sep 10, 2024

View reviewed changes

_observing-your-data/ad/index.md Show resolved Hide resolved

vagimeli added 3 - Tech review PR: Tech review in progress and removed 5 - Editorial review PR: Editorial review in progress labels Sep 11, 2024

vagimeli reviewed Sep 11, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

e2c656e

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 11, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

fe79e71

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 11, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

dcbce5a

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 11, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

vagimeli reviewed Sep 11, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

8a3b25d

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 12, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

bc9488a

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 12, 2024

View reviewed changes

_observing-your-data/ad/index.md Show resolved Hide resolved

Update _observing-your-data/ad/index.md

50eff8b

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 12, 2024

View reviewed changes

_observing-your-data/ad/index.md Show resolved Hide resolved

vagimeli added 3 commits September 12, 2024 17:26

Update _observing-your-data/ad/index.md

2c2e06c

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Update index.md

894efee

Copy edit documentation Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Update result-mapping.md

5738739

Doc review complete Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 13, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

14dc454

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli reviewed Sep 13, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

4ad9e02

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli approved these changes Sep 13, 2024

View reviewed changes

Merge branch 'main' into 2.17

4d7f738

vagimeli added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Sep 13, 2024

vagimeli added 2 commits September 13, 2024 08:20

Fix links

9afca30

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Fix links

0067b5d

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

natebower reviewed Sep 13, 2024

View reviewed changes

vagimeli added 3 commits September 13, 2024 11:03

Address editorial feedback

a99969b

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Address editorial feedback

7ea3d63

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

Merge branch 'main' into 2.17

4b42bc2

Merge branch 'main' into 2.17

f9434ec

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

kaituo commented Sep 13, 2024

View reviewed changes

vagimeli removed the 5 - Editorial review PR: Editorial review in progress label Sep 13, 2024

vagimeli reviewed Sep 13, 2024

View reviewed changes

_observing-your-data/ad/index.md Outdated Show resolved Hide resolved

Update _observing-your-data/ad/index.md

ca49c0c

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

vagimeli merged commit 8c74b88 into opensearch-project:main Sep 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for rule-based anomaly detection and imputation #8202

Add documentation for rule-based anomaly detection and imputation #8202

kaituo commented Sep 9, 2024 •

edited by gaiksaya

Loading

github-actions bot commented Sep 9, 2024

vagimeli left a comment

natebower left a comment

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

natebower Sep 13, 2024

vagimeli commented Sep 13, 2024

kaituo Sep 13, 2024


		Anomaly detection automatically detects anomalies in your OpenSearch data in near real-time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an `anomaly grade` and `confidence score` value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).
		Anomaly detection automatically detects anomalies in your OpenSearch data in near real time using the Random Cut Forest (RCF) algorithm. RCF is an unsupervised machine learning algorithm that models a sketch of your incoming data stream to compute an _anomaly grade_ and _confidence score_ value for each incoming data point. These values are used to differentiate an anomaly from normal variations. For more information about how RCF works, see [Random Cut Forests](https://www.semanticscholar.org/paper/Robust-Random-Cut-Forest-Based-Anomaly-Detection-on-Guha-Mishra/ecb365ef9b67cd5540cc4c53035a6a7bd88678f9).

	2. On the Define detector page, enter the required information on the Detector details pane.
	2. On the Define detector page, enter the required information in the Detector details pane.

	3. On the Select data pane, specify the data source by choosing a source from the Index dropdown menu. You can choose an index, index patterns, or alias.
	3. In the Select data pane, specify the data source by choosing a source from the Index dropdown menu. You can choose an index, index patterns, or an alias.

	The following example query retrieves documents where the `urlPath.keyword` field matches any of the specified values:
	The following example query retrieves documents in which the `urlPath.keyword` field matches any of the specified values:


		You can see the following additional fields:
		Note that the result includes the following additional field:


		If a detector detects an anomaly late, the result has the following additional fields:
		When a detector detects an anomaly late, the result includes the following additional fields:

	When a detector detects an anomaly late, the result includes the following additional fields:
	When a detector is late in detecting an anomaly, the result includes the following additional fields.

Add documentation for rule-based anomaly detection and imputation #8202

Add documentation for rule-based anomaly detection and imputation #8202

Conversation

kaituo commented Sep 9, 2024 • edited by gaiksaya Loading

Description

Issues Resolved

Version

Frontend features

Checklist

github-actions bot commented Sep 9, 2024

vagimeli left a comment

Choose a reason for hiding this comment

natebower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagimeli commented Sep 13, 2024

Choose a reason for hiding this comment

kaituo commented Sep 9, 2024 •

edited by gaiksaya

Loading