Skip to content

Latest commit

 

History

History
168 lines (137 loc) · 22.7 KB

File metadata and controls

168 lines (137 loc) · 22.7 KB
X-ROAD European Union / European Regional Development Fund / Investing in your future

Analysis Module

License

This document is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/

Customization

Back-end

The Analyzer back-end can be configured from analysis_module/analyzer/analyzer_conf.py.

Parameter Description Example
timestamp_field Database field to use as the primary timestamp in the analysis. timestamp_field = 'requestInTs'
service_call_fields Database fields that (together) define a service call. service_call_fields = ["clientMemberClass", "clientMemberCode", "clientXRoadInstance", "clientSubsystemCode", "serviceCode", "serviceVersion", "serviceMemberClass", "serviceMemberCode", "serviceXRoadInstance", "serviceSubsystemCode"]
relevant_cols_general Database fields from the clean_data collection that are relevant for the analyzer and appear at the top level of the request. relevant_cols_general = ["_id", 'totalDuration', 'producerDurationProducerView', 'requestNwDuration', 'responseNwDuration']
relevant_cols_nested Database fields from the clean_data collection that are relevant for the analyzer and are nested inside 'client' and 'producer'. relevant_cols_nested = ["succeeded", "messageId", timestamp_field] + service_call_fields
relevant_cols_general_alternative Database fields from the clean_data collection that are relevant for the analyzer and appear at the top level of the request, but are analogous for 'client' and 'producer' side.
For the Analyzer, only one field from each pair is necessary. In other words, if the field exists for the client side, then this value is used, otherwise the value from the producer side is used.
In configuration, these fields are presented as triplets, where the first element refers to the general name used in the Analyzer, the second and third value are the alternative fields in the database.
relevant_cols_general_alternative = [('requestSize', 'clientRequestSize', 'producerRequestSize')]
<timeunit>_aggregation_time_window Settings for a given aggregation time window. The following attributes should be speficied:
1) 'agg_window_name' - a name (can be chosen arbitrarily) that will be used to refer to the aggregation window,
2) 'agg_minutes' - number of minutes to use for aggregation,
3) 'pd_timeunit' - used in the pandas.to_timedelta method to refer to the same time period, should be one of (D,h,m,s,ms,us,ns). (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html).
hour_aggregation_time_window = {'agg_window_name': 'hour', 'agg_minutes': 60, 'pd_timeunit': 'h'}
<timeunits>_similarity_time_window Settings for a given similarity time window. For example, if the aggregation time window is hour, the similarity time window can be hour+weekday, meaning that the aggregated values from a given hour are compared to historic values collected from the same hour on the same weekday. The following attributes should be speficied:
1) 'timeunit_name' - a name (can be chosen arbitrarily) that will be used to refer to the similarity window,
2) 'agg_window' - one of <timeunit>_aggregation_time_window,
3) 'similar_periods' - a list of time periods. A given set of aggregated requests will be compared to the combination of these periods. Each value in the list is used to extract the necessary time component from a pandas.DatetimeIndex object, so each value should be one of (year, month, day, hour, minute, second, microsecond, nanosecond, dayofyear, weekofyear, week, dayofweek, weekday, quarter). (http://pandas.pydata.org/pandas-docs/version/0.17.0/api.html#time-date-components)
hour_weekday_similarity_time_window = {'timeunit_name': 'hour_weekday', 'agg_window': hour_aggregation_time_window, 'similar_periods': ['hour', 'weekday']}
time_windows A dictionary of pairs (anomaly_type, previously defined <timeunit>_aggregation_time_window) for anomaly types that do not require comparison with historic values. The specified time window will be used to aggregate requests for the given anomaly type. time_windows = {
"failed_request_ratio": hour_aggregation_time_window,
"duplicate_message_ids": day_aggregation_time_window,
"time_sync_errors": hour_aggregation_time_window}
historic_averages_time_windows A list of previously defined <timeunits>_similarity_time_windows for anomaly types that require comparison with historic averages. A separate AveragesByTimeperiodModel is constructed for each such similarity time window. historic_averages_time_windows = [hour_weekday_similarity_time_window, weekday_similarity_time_window]
historic_averages_thresholds A dictionary of confidence thresholds used in the AveragesByTimeperiodModel(s). An observation (an aggregation of requests within a given time window) is considered an anomaly if the confidence (estimated by the model) of being an anomaly is larger than this threshold. historic_averages_thresholds = {
'request_count': 0.95,
'mean_request_size': 0.95,
'mean_response_size': 0.95,
'mean_client_duration': 0.95,
'mean_producer_duration': 0.95} ]
time_sync_monitored_lower_thresholds A dictionary of minimum value thresholds used in the TimeSyncModel. If the observed value is smaller than this threshold, an incident is reported. time_sync_monitored_lower_thresholds = {'requestNwDuration': 0,
'responseNwDuration': 0}
failed_request_ratio_threshold Used in the FailedRequestRatioModel. If the ratio of failed requests in a given aggregation window is larger than this threshold, an incident is reported. failed_request_ratio_threshold = 0.9
incident_expiration_time After this time has passed since the creation of an anomaly (potential incident), the requests involved in these anomalies can be used to update the historic averages models. The time is specified in minutes. It is recommended to keep this parameter the same as the respective parameter in the front-end configuration. incident_expiration_time = 14400
(anomalies will expire after 10 days)
training_period_time After this time has passed since a given service call's first request, the first version of the historic averages model is trained and the first anomalies reported. The time is specified in months. training_period_time = 3
(training period lasts for 3 months)

Front-end

The user interface can be configured from analysis_module/analyzer_ui/gui/gui_conf.py.

Parameter Description Example
service_call_fields Database fields that (together) define a service call. service_call_fields = ["clientMemberClass", "clientMemberCode", "clientXRoadInstance", "clientSubsystemCode", "serviceCode", "serviceVersion", "serviceMemberClass", "serviceMemberCode", "serviceXRoadInstance", "serviceSubsystemCode"]
new_incident_columns List of columns that will be shown in the incident table (where new anomalies are presented). Each column is represented by a tuple, containing the following elements:
1) the name of the column (can be chosen arbitrarily)
2) the respective database field in the incident collection,
3) the data type of the column, must be one of (categorical, numeric, date, text),
4) the rounding precision (only relevant if the data type is numeric),
5) the date format to be used (only relevant if the data type is date)
new_incident_columns = [
("anomalous_metric", "anomalous_metric", "categorical", None, None),
("anomaly<br>confidence", "anomaly_confidence", "numeric", 2, None),
("period_start_time", "period_start_time", "date", None, "%a, %Y-%m-%d %H:%M"),
("description", "description", "text", None, None),
("request_count", "request_count", "numeric", 0, None)]
new_incident_order A list of conditions to use for ordering the incidents table. Each condition contains two elements:
1) database field name, must be one of the database fields defined in new_incident_columns,
2) order direction, must be one of (asc, desc)
new_incident_order = [["request_count", "desc"]]
historical_incident_columns List of columns that will be shown in the history table (where anomalies whose status has already been marked by the user are presented). Each column is represented by a tuple, containing the following elements:
1) the name of the column (can be chosen arbitrarily)
2) the respective database field in the incident collection,
3) the data type of the column, must be one of (categorical, numeric, date, text),
4) the rounding precision (only relevant if the data type is numeric),
5) the date format to be used (only relevant if the data type is date)
historical_incident_columns = [
("incident_status", "incident_status", "categorical", None, None),
("incident_update_timestamp", "incident_update_timestamp", "date", None, "%a, %Y-%m-%d %H:%M")]
historical_incident_order A list of conditions to use for ordering the history table. Each condition contains two elements:
1) database field name, must be one of the database fields defined in new_incident_columns,
2) order direction, must be one of (asc, desc)
historical_incident_order = [["incident_update_timestamp", "desc"]]
relevant_fields_for_example_requests_general A list of database fields from the clean_data collection, which appear at the top level of the request, to be shown in the example requests table. relevant_fields_for_example_requests_general = ['totalDuration', 'producerDurationProducerView']
relevant_fields_for_example_requests_nested A list of database fields from the clean_data collection, which are nested inside 'client' and 'producer', to be shown in the example requests table. relevant_fields_for_example_requests_nested = ['messageId', 'requestInTs', 'succeeded']
relevant_fields_for_example_requests_alternative A list of database fields from the clean_data collection, which appear at the top level of the request but are analogous for 'client' and 'producer' side, to be shown in the example requests table. relevant_fields_for_example_requests_alternative = [
('responseSize', 'clientResponseSize', 'producerResponseSize'),
('requestSize', 'clientRequestSize', 'producerRequestSize')]
example_request_limit Up to this many "example" requests will be shown for each anomaly. example_request_limit = 10
accepted_date_formats When filtering anomalies according to a date field, the user input must be in one of these date formats. accepted_date_formats = ["%a, %Y-%m-%d %H:%M", "%Y-%m-%d %H:%M", "%Y-%m-%d", "%d/%m/%Y %H:%M", "%d/%m/%Y"]
incident_expiration_time An anomaly will be shown in the user interface only until this time has passed since the creation of the anomaly. The time is specified in minutes. It is recommended to keep this parameter the same as the respective parameter in the back-end configuration. incident_expiration_time = 14400
(anomalies will expire after 10 days)

Database

In order to work properly, both the back-end and front-end of the Analysis module need configurations for accessing the database. These settings are specified separately for Analyzer (analysis_module/analyzer/settings.py) and Interface (analysis_module/analyzer_ui/analyzer_ui/settings.py).

Parameter Description Example
MDB_USER Username for accessing the database MDB_USER = "sample_user"
MDB_PWD Password for accessing the database MDB_PWD = "password_for_sample_user"
MDB_SERVER Database server location MDB_SERVER = "opmon"
MONGODB_URI Database URI MONGODB_URI = "mongodb://{0}:{1}@{2}/auth_db".format(MDB_USER, MDB_PWD, MDB_SERVER)
MONGODB_QD Query database name MONGODB_QD = "query_db_sample"
MONGODB_AD Analyzer database name MONGODB_AD = "analyzer_database_sample"

Databases

The Analyzer takes as input data from the Query database (clean_data collection). The results will be written to Analyzer database (a MongoDB instance). Namely, there are four collections in the incident database:

incident: All found anomalies will be saved here, as well as the last status of each anomaly/incident (automatic or marked by the user). incident_timestamps: This collection keeps track of the times when the historic averages model was last updated. Also, the last anomaly-finding times for each anomaly type will be saved here. incident_model: The historic averages models are saved here. service_call_first_timestamps: Timestamps for each service call's first request, first model training, first anomaly finding, and first model retraining.

Incident collection schema

All found anomalies (potential incidents) are saved in the incident collection. Each entry in this collection contains the following fields:

Field Description Possible values and data type
_id Automatically generated id for the collection entry. A MongoDB ObjectId value.
aggregation_timeunit The time interval used to aggregate requests for this anomaly. hour, day (categorical)
period_start_time Start time (included) of the aggregation time interval used for this anomaly. (date)
period_end_time End time (excluded) of the aggregation time interval used for this anomaly. (date)
request_count Number of requests in the aggregation time interval. (numeric)
request_ids List of request id-s included in this anomaly. List of MongoDB ObjectId-s.
anomalous_metric The anomaly type failed_request_ratio, duplicate_message_id, responseNwDuration, requestNwDuration, request_count, mean_request_size, mean_response_size, mean_request_duration, mean_response_duration (categorical)
monitored_metric_value The observed value, e.g. the observed mean request size for the mean_request_size anomaly, or the number of duplicated message ids in case of the duplicate_message_id anomaly. (numeric)
difference_from_normal Difference of the observed value and the "normal" value. The "normal" value is:
1) the historic average in case of historic average anomalies,
2) the failed request ratio threshold (the largest allowed value) in case of failed request ratio anomalies,
3) 1 in case of duplicate message id anomalies,
4) 0 in case of time sync anomalies.
(numeric)
anomaly_confidence Confidence of the anomaly, as estimated by the model. The higher the confidence, the more it deviates from the historical values. Anomaly types that do not require comparison with historic values always have a confidence of 1. Between 0 and 1. (numeric)
description Textual description of the anomaly. (text)
incident_creation_timestamp Time when the incident was created. (date)
incident_update_timestamp Time when the incident was last updated. If the status has not yet been marked by the user, this is the same as the incident_creation_timestamp, otherwise the time of the last status update. (date)
incident_status The status of the incident. Can be automatically assigned (new, showed), or marked by the user (incident, viewed, normal). new, showed, incident, viewed, normal (categorical)
model_params If relevant, some parameters of the model that was used to find the anomaly. For example, the model_params for anomaly type mean_request_size is a dictionary:
'model_params': {'hour': 9,
'metric_mean': 1729.4074074074076,
'metric_std': 42.25904826772402,
'model_timeunit': 'hour_weekday',
'weekday': 1}
model_version Version of the model that was used to find the anomaly. from 0 to any integer (numeric)
clientMemberClass The clientMemberClass (part of the service call) whom this anomaly belongs to.
clientMemberCode The clientMemberCode (part of the service call) whom this anomaly belongs to.
clientSubsystemCode The clientSubsystemCode (part of the service call) whom this anomaly belongs to.
clientXRoadInstance The clientXRoadInstance (part of the service call) whom this anomaly belongs to.
serviceCode The serviceCode (part of the service call) whom this anomaly belongs to.
serviceMemberClass The serviceMemberClass (part of the service call) whom this anomaly belongs to.
serviceMemberCode The serviceMemberCode (part of the service call) whom this anomaly belongs to.
serviceSubsystemCode The serviceSubsystemCode (part of the service call) whom this anomaly belongs to.
serviceVersion The serviceVersion (part of the service call) whom this anomaly belongs to.
serviceXRoadInstance The serviceXRoadInstance (part of the service call) whom this anomaly belongs to.

Incident timestamps collection schema

This collection is used internally, to ensure that the scripts for finding anomalies and for updating the historic averages model take as input data from the right time period. In particular, each request should only be used once by each model in both the anomaly finding and model updating phase. Each entry in this collection contains the following fields:

Field Description Possible values
_id Automatically generated id for the collection entry. A MongoDB ObjectId value.
model The name of the Analyzer model. hour_weekday, weekday, failed_request_ratio, duplicate_message_ids, time_sync_errors
type Type of the timestamp. last_fit_timestamp, last_transform_timestamp
timestamp The timestamp. A datetime value.

Incident model collection schema

This collection saves the historic averages models. The collection is 1) updated when the model is being retrained / updated, and 2) retrieved when anomalies are being found.

Field Description Possible values and data type
_id Automatically generated id for the collection entry. A MongoDB ObjectId value.
clientMemberClass The clientMemberClass (part of the service call) related to this model row.
clientMemberCode The clientMemberCode (part of the service call) related to this model row.
clientSubsystemCode The clientSubsystemCode (part of the service call) related to this model row.
clientXRoadInstance The clientXRoadInstance (part of the service call) related to this model row.
serviceCode The serviceCode (part of the service call) related to this model row.
serviceMemberClass The serviceMemberClass (part of the service call) related to this model row.
serviceMemberCode The serviceMemberCode (part of the service call) related to this model row.
serviceSubsystemCode The serviceSubsystemCode (part of the service call) related to this model row.
serviceVersion The serviceVersion (part of the service call) related to this model row.
serviceXRoadInstance The serviceXRoadInstance (part of the service call) related to this model row.
<metric>_mean Historic average (mean) of the given metric (request_count, mean_response_size, mean_request_size, mean_client_duration, mean_producer_duration) for the given service call in the given time period. numeric
<metric>_std Standard deviation of the given metric for the given service call in the given time period. numeric
<metric>_count Number of values (from aggregated time periods) that are used to calculate the mean and std. integer
<metric>_sum Sum of the values for the given metric. Necessary for incrementally updating the standard deviation values. integer
<metric>_ssq Sum of squares of the given metric. Necessary for incrementally updating the standard deviation values. integer
model_name Name of the model. hour_weekday, weekday
similar_periods Concatenated values of the "similar" time periods. E.g. for model "hour_weekday", similar_periods = "12_1" refer to 12 o'clock on Mondays.
model_creation_timestamp Creation time of the model (same for all service calls, even if they were added later). date
version Version of the model. Version gets incremented with every update. Only the last version of the model for each model_name is saved. integer

Service call first timestamps collection schema

This collection is used to keep track of the phases that each service call is in: training, first incidents reported but model not retrained, regular (model retrained).

Field Description Possible values and data type
_id Automatically generated id for the collection entry. A MongoDB ObjectId value.
clientMemberClass The clientMemberClass (part of the service call).
clientMemberCode The clientMemberCode (part of the service call).
clientSubsystemCode The clientSubsystemCode (part of the service call).
clientXRoadInstance The clientXRoadInstance (part of the service call).
serviceCode The serviceCode (part of the service call).
serviceMemberClass The serviceMemberClass (part of the service call).
serviceMemberCode The serviceMemberCode (part of the service call).
serviceSubsystemCode The serviceSubsystemCode (part of the service call).
serviceVersion The serviceVersion (part of the service call).
serviceXRoadInstance The serviceXRoadInstance (part of the service call).
first_request_timestamp Timestamp of the first request made by the service call. date
first_model_train_timestamp Timestamp of the first model trained for the service call. If the service call is still in the training phase, the timestamp is None. date
first_incident_timestamp Timestamp when the first anomaly-finding phase was performed for the service call. If the service call is still in the training phase, the timestamp is None. date
first_model_retrain_timestamp Timestamp when the second version of the model was trained (after the training period has passed and first incidents have expired). When the service call is still in training phase or the time for expiration of first incidents has not passed, the timestamp is None. If this timestamp is present, the service call has reached the regular analysis phase. date