Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Throughput Anomaly detector files and CLI support #152

Merged
merged 1 commit into from
Mar 1, 2023

Conversation

tushartathgur
Copy link
Contributor

This PR is resonsible for:
1. Adding throughput Anomaly detector files to Theia.
2. Adding CLI support.
3. Adding CRD for the anomaly detector.

Signed-off-by: Tushar Tathgur tathgurt@tathgurtFLVDL.vmware.com

@tushartathgur tushartathgur reopened this Jan 19, 2023
@tushartathgur tushartathgur linked an issue Jan 19, 2023 that may be closed by this pull request
@tushartathgur tushartathgur marked this pull request as draft January 19, 2023 03:46
@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 11 times, most recently from 0296f01 to 95fd24d Compare January 25, 2023 23:17
@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 3 times, most recently from d8b0fc0 to c9d897b Compare January 30, 2023 22:19
@tushartathgur tushartathgur marked this pull request as ready for review January 30, 2023 22:19
@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 11 times, most recently from e59d9e0 to c81fb23 Compare February 6, 2023 22:49
@yanjunz97
Copy link
Contributor

I notice when running algorithm ARIMA with only Theia components installed, it will mark the traffic from ClickHouse pod to ZooKeeper pod as anomaly. I understand this might only indicate that there are many data from ClickHouse to ZooKeeper, but I'm thinking about whether this behavior is expected by the users. I just leave this as an open question about whether we should treat the throughput between Theia components different as we do for other traffic. For example, can we predefine some normal pattern for the Theia components or set a different sensitive based on our experiments, etc. to decrease the false-positive for the components in Theia.

@dreamtalen
Copy link

I notice when running algorithm ARIMA with only Theia components installed, it will mark the traffic from ClickHouse pod to ZooKeeper pod as anomaly. I understand this might only indicate that there are many data from ClickHouse to ZooKeeper, but I'm thinking about whether this behavior is expected by the users. I just leave this as an open question about whether we should treat the throughput between Theia components different as we do for other traffic. For example, can we predefine some normal pattern for the Theia components or set a different sensitive based on our experiments, etc. to decrease the false-positive for the components in Theia.

Good point, we're ignoring flows related to these 3 namespaces in policy recommendation: ["kube-system", "flow-aggregator", "flow-visibility"]. We could do something similiar here.

@yanjunz97
Copy link
Contributor

Good point, we're ignoring flows related to these 3 namespaces in policy recommendation: ["kube-system", "flow-aggregator", "flow-visibility"]. We could do something similiar here.

Yep, that could be a solution. But I'm thinking about anomaly detection might be different from policy recommendation as it is possible that the malware uses Theia components to propagate the malicious traffic. Directly excluding the namespace may leave them unmonitored.

@dreamtalen
Copy link

Good point, we're ignoring flows related to these 3 namespaces in policy recommendation: ["kube-system", "flow-aggregator", "flow-visibility"]. We could do something similiar here.

Yep, that could be a solution. But I'm thinking about anomaly detection might be different from policy recommendation as it is possible that the malware uses Theia components to propagate the malicious traffic. Directly excluding the namespace may leave them unmonitored.

I agree. This ignore list of namespace could be configured by CLI like policy recommendation so I think it would be good enough for now.
Later when we have the scoring/sensitivity system implemented, we could try different sensitivity settings for these system components.

@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 2 times, most recently from 1628e5c to 5155514 Compare February 22, 2023 18:42
@@ -94,6 +94,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install pytest-cov
python -m pip install -r plugins/policy-recommendation/requirements.txt
python -m pip install -r plugins/anomaly-detection/requirements.txt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add anomaly-detection folder in other checks in this file, including check-python-changes, python-lint, test-unit etc. This will enable our CI job to check code style and run UT for the pyspark files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

README.md Outdated
@@ -20,6 +20,10 @@ NetworkPolicy configuration to secure Kubernetes network and applications.
Please refer to the [NetworkPolicy Recommendation](docs/networkpolicy-recommendation.md)
user guide to learn more.

Theia also provides throughput anomaly detection, it can find the anomalies in the network, and report them to the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap text at 80 characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

throughputStandardDeviation String,
algoType String,
algoCalc String,
Throughputs String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughputs -> throughputs or throughput if you only store a single throughput value in this column

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the datatype should not be String if you are storing throughput value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

flowEndSeconds String,
throughputStandardDeviation String,
algoType String,
algoCalc String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datatype should not be String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

destinationTransportPort UInt16,
protocolIdentifier UInt16,
flowStartSeconds DateTime,
flowEndSeconds String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datatype should not be String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

protocolIdentifier UInt16,
flowStartSeconds DateTime,
flowEndSeconds String,
throughputStandardDeviation String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datatype should not be String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

[NetworkPolicy Recommendation](networkpolicy-recommendation.md), please install
Theia by runnning the following commands:
To enable [Grafana Flow Collector](network-flow-visibility.md),
[NetworkPolicy Recommendation](networkpolicy-recommendation.md) and [Throughput Anomaly Detection](throughput-anomaly-detection.md), please install

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap at 80 characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -0,0 +1,226 @@
// Copyright 2022 Antrea Authors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Yanjun here, anomaly_detection_run.go is a better name because it can be consistent with our command.
Regarding to your concern, we can set a default algorithm like ARIMA for this command.

return nil
}

func init() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Yanjun pointing out a concern and our discussion here #152 (comment), please add an option to let user ignore anomaly detection on some namespaces

Copy link
Contributor Author

@tushartathgur tushartathgur Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you guys, however, introducing a new argument will lead to change in lot of files, thus in regards to the timeframe for 0.5 , I suggest we should do it with #168 . Please let me know if you guys agree?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with pushing this back to the next release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

"antrea.io/theia/pkg/util"
)

// throughputAnomalyDetectionEWMACmd represents the policy-recommendation delete command

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

"start-time",
"s",
"",
`The start time of the flow records considered for the policy recommendation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy recommendation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

"end-time",
"e",
"",
`The end time of the flow records considered for the policy recommendation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy recommendation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

"antrea.io/theia/pkg/util"
)

// anomalyDetectionDeleteCmd represents the policy-recommendation delete command

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy-recommendation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

It will return the recommended NetworkPolicies described in yaml.`,
Args: cobra.RangeArgs(0, 1),
Example: `
Get the recommendation result with job name tad-e998433e-accb-4888-9fc8-06563f073e86

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

$ theia throughput-anomaly-detection retrieve tad-e998433e-accb-4888-9fc8-06563f073e86
Use Service ClusterIP when getting the result
$ theia throughput-anomaly-detection retrieve tad-e998433e-accb-4888-9fc8-06563f073e86 --use-cluster-ip
Save the recommendation result to file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -1,4 +1,4 @@
// Copyright 2022 Antrea Authors.
// Copyright 2023 Antrea Authors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to update this when editing an existing file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -1,4 +1,4 @@
// Copyright 2022 Antrea Authors
// Copyright 2023 Antrea Authors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

return nil
}

func tadrunJob(t *testing.T, data *TestData, algotype string) (stdout string, jobName string, err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this series of functions runJob, getJobStatus, listJobs, deleteJob, retrieveJobResult in the policy recommendation e2e test, the only difference here is we are passing different command string. Could you please reuse these functions for these two e2e tests?

Copy link
Contributor Author

@tushartathgur tushartathgur Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, added them to util.go

@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 7 times, most recently from b360f56 to 48206a9 Compare February 27, 2023 21:35
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@@ -0,0 +1,2 @@
--Drop table
DROP tadetector_local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end of the file.

DROP TABLE tadetector;

CREATE TABLE IF NOT EXISTS tadetector AS tadetector_local
engine=Distributed('{cluster}', default, tadetector_local, rand());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line at the end of the file

Comment on lines 2 to 17
CREATE TABLE IF NOT EXISTS tadetector_local (
sourceIP String,
sourceTransportPort UInt16,
destinationIP String,
destinationTransportPort UInt16,
protocolIdentifier UInt16,
flowStartSeconds DateTime,
flowEndSeconds DateTime,
throughputStandardDeviation Float64,
algoType String,
algoCalc Float64,
throughput Float64,
anomaly String,
id String
) engine=ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}', '{replica}')
ORDER BY (flowStartSeconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra indentation

Comment on lines 19 to 21
--Move data from old table and drop the old table
INSERT INTO tadetector_local SELECT * FROM tadetector;
DROP TABLE tadetector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get why we need this step as I think in v0.4.0 we do not have table tadetector?

@@ -54,6 +55,7 @@ func createAPIServerConfig(
tlsMinVersion uint16,
nprq querier.NPRecommendationQuerier,
chq querier.ClickHouseStatQuerier,
tad querier.ThroughputAnomalyDetectorQuerier,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename it as tadq to match the other naming?

@@ -0,0 +1,28 @@
// Copyright 2022 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe instead of having a separate anomalydetector folder for TAD api, we can put these in intelligence, as TAD is a part of intelligence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have a strong opinion on this, but imo it would be better have the separate set of APIs in a separate folder and then the sub apis in the same folder, however there is another thought process that all the apis should be in one folder.
lemme know which one works best ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we should either moving anomalydetector into intelligence or renaming intelligence to policyrecommendation, otherwise the package naming strategy is not consistent. I personally prefer the first one, but do not have strong opinion. cc @dreamtalen @wsquan171 for suggestions.

Copy link

@dreamtalen dreamtalen Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should place anomalydetector under intelligence, I remember that was our design when introducing theia manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I would like to defer it to #168

@@ -0,0 +1,212 @@
// Copyright 2022 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, maybe move files under anomalydetector/throughputanomalydetector to intelligence/throughputanomalydetector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above


// Run will create defaultWorkers workers (go routines) which will process the Service events from the
// workqueue.
func (c *AnomalyDetectorController) Run(stopCh <-chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need a resync logic like we have here #150 for NPR. And maybe reformat or reuse some of the code. But I'm fine to do it in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I would like to defer it to #168

@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 2 times, most recently from b30ce50 to f5db175 Compare February 28, 2023 17:58
@@ -135,7 +123,39 @@ func getPolicyRecommendationProgress(baseUrl string) (completedStages int, total
return completedStages, totalStages, nil
}

func deletePolicyRecommendationResult(connect *sql.DB, id string) (err error) {
func GetTADetectorProgress(baseUrl string) (completedStages int, totalStages int, err error) {
Copy link

@dreamtalen dreamtalen Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is exactly the same as GetPolicyRecommendationProgress, can you merge them into one, like GetSparkApplicationProgress

@@ -144,7 +164,16 @@ func deletePolicyRecommendationResult(connect *sql.DB, id string) (err error) {
return nil
}

func getPolicyRecommendationIds(connect *sql.DB) ([]string, error) {
func DeleteTADetectorResult(connect *sql.DB, id string) (err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, the only difference of DeleteTADetectorResult and DeletePolicyRecommendationResult is table name, could you combine them by passing a table name as argument?

w := tabwriter.NewWriter(os.Stdout, 15, 8, 1, '\t', tabwriter.AlignRight)
fmt.Fprintf(w, "id\tsourceIP\tsourceTransportPort\tdestinationIP\tdestinationTransportPort\tflowStartSeconds\tflowEndSeconds\tthroughput\talgoCalc\tanomaly\n")
for _, p := range tad.Stats {
fmt.Fprintf(w, "%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\n", p.Id, p.SourceIP, p.SourceTransportPort, p.DestinationIP, p.DestinationTransportPort, p.FlowStartSeconds, p.FlowEndSeconds, p.Throughput, p.AlgoCalc, p.Anomaly)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you format the throughput and algoCalc in the table output? For example, in #168, the example output

id                                      sourceIP        sourceTransportPort     destinationIP   destinationTransportPort        flowStartSeconds        flowEndSeconds          throughput                         algoCalc                anomaly
eec9d1be-7204-4d50-8f57-d9c8757a2668    10.10.1.25      58076                   10.10.1.33      5201                            2022-08-11T06:26:54Z    2022-08-11T08:24:54     10004969097.000000000000000000      4.0063773860532994E9    true

I feel users will have a hard time when reading these two numbers 10004969097.000000000000000000 and 4.0063773860532994E9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Could be handled in #168

@tushartathgur tushartathgur force-pushed the Anomaly_detector branch 3 times, most recently from af49d03 to e377768 Compare February 28, 2023 22:51
Copy link

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR.
I understand that we have decided to defer addressing some comments due to a tight schedule. Could you please track those items somewhere as a list, such as in a comment here or in the description of your next PR, just to make sure we're not missing anything?

Copy link
Contributor

@yanjunz97 yanjunz97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, only some nits

@@ -0,0 +1,212 @@
// Copyright 2022 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The the new files should have 2023 instead of 2022

Suggested change
// Copyright 2022 Antrea Authors
// Copyright 2023 Antrea Authors

@@ -0,0 +1,223 @@
// Copyright 2020 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
// Copyright 2020 Antrea Authors
// Copyright 2023 Antrea Authors

@@ -0,0 +1,650 @@
// Copyright 2022 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for new files in this folder

@@ -0,0 +1,43 @@
// Copyright 2022 Antrea Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for new files in this folder

@@ -234,10 +234,10 @@ grafana:
# volumeName: ""
sparkOperator:
# -- Determine whether to install Spark Operator. It is required to run Network
# Policy Recommendation jobs.
# Policy Recommendation and Throughput Anomaly detection jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Policy Recommendation and Throughput Anomaly detection jobs.
# Policy Recommendation and Throughput Anomaly Detection jobs.

Comment on lines 169 to 171
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "",
`algorithm to use -> The algorithm used by throughput anomaly detection.
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "",
`algorithm to use -> The algorithm used by throughput anomaly detection.
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`)
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "",
`The algorithm used by throughput anomaly detection.
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`)

This PR is resonsible for:
    1. Adding throughput Anomaly detector files to Theia.
    2. Adding CLI support. CLI support for get, list and status is present, commands are yet to be implemented
    3. Adding CRD for the anomaly detector.
    4. Adding e2e Test for all three algos of TAD.
    5. Adding Unit tests
    6. Using same spark operator for all spark jobs

NOTE: TODO Add stat as an API

Signed-off-by: Tushar Tathgur <tathgurt@tathgurtFLVDL.vmware.com>
@tushartathgur
Copy link
Contributor Author

/theia-test-e2e

@tushartathgur tushartathgur merged commit 6435a00 into antrea-io:main Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Network Threat Analytics - Throughput Anomaly Detector
3 participants