-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Throughput Anomaly detector files and CLI support #152
Conversation
0296f01
to
95fd24d
Compare
d8b0fc0
to
c9d897b
Compare
e59d9e0
to
c81fb23
Compare
I notice when running algorithm ARIMA with only Theia components installed, it will mark the traffic from ClickHouse pod to ZooKeeper pod as anomaly. I understand this might only indicate that there are many data from ClickHouse to ZooKeeper, but I'm thinking about whether this behavior is expected by the users. I just leave this as an open question about whether we should treat the throughput between Theia components different as we do for other traffic. For example, can we predefine some normal pattern for the Theia components or set a different sensitive based on our experiments, etc. to decrease the false-positive for the components in Theia. |
Good point, we're ignoring flows related to these 3 namespaces in policy recommendation: ["kube-system", "flow-aggregator", "flow-visibility"]. We could do something similiar here. |
Yep, that could be a solution. But I'm thinking about anomaly detection might be different from policy recommendation as it is possible that the malware uses Theia components to propagate the malicious traffic. Directly excluding the namespace may leave them unmonitored. |
I agree. This ignore list of namespace could be configured by CLI like policy recommendation so I think it would be good enough for now. |
1628e5c
to
5155514
Compare
@@ -94,6 +94,7 @@ jobs: | |||
python -m pip install --upgrade pip | |||
python -m pip install pytest-cov | |||
python -m pip install -r plugins/policy-recommendation/requirements.txt | |||
python -m pip install -r plugins/anomaly-detection/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add anomaly-detection folder in other checks in this file, including check-python-changes
, python-lint
, test-unit
etc. This will enable our CI job to check code style and run UT for the pyspark files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
README.md
Outdated
@@ -20,6 +20,10 @@ NetworkPolicy configuration to secure Kubernetes network and applications. | |||
Please refer to the [NetworkPolicy Recommendation](docs/networkpolicy-recommendation.md) | |||
user guide to learn more. | |||
|
|||
Theia also provides throughput anomaly detection, it can find the anomalies in the network, and report them to the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap text at 80 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
throughputStandardDeviation String, | ||
algoType String, | ||
algoCalc String, | ||
Throughputs String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughputs
-> throughputs
or throughput
if you only store a single throughput value in this column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the datatype should not be String
if you are storing throughput value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
flowEndSeconds String, | ||
throughputStandardDeviation String, | ||
algoType String, | ||
algoCalc String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datatype should not be String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
destinationTransportPort UInt16, | ||
protocolIdentifier UInt16, | ||
flowStartSeconds DateTime, | ||
flowEndSeconds String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datatype should not be String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
protocolIdentifier UInt16, | ||
flowStartSeconds DateTime, | ||
flowEndSeconds String, | ||
throughputStandardDeviation String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datatype should not be String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
docs/getting-started.md
Outdated
[NetworkPolicy Recommendation](networkpolicy-recommendation.md), please install | ||
Theia by runnning the following commands: | ||
To enable [Grafana Flow Collector](network-flow-visibility.md), | ||
[NetworkPolicy Recommendation](networkpolicy-recommendation.md) and [Throughput Anomaly Detection](throughput-anomaly-detection.md), please install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap at 80 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
@@ -0,0 +1,226 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Yanjun here, anomaly_detection_run.go
is a better name because it can be consistent with our command.
Regarding to your concern, we can set a default algorithm like ARIMA for this command.
return nil | ||
} | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Yanjun pointing out a concern and our discussion here #152 (comment), please add an option to let user ignore anomaly detection on some namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you guys, however, introducing a new argument will lead to change in lot of files, thus in regards to the timeframe for 0.5 , I suggest we should do it with #168 . Please let me know if you guys agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with pushing this back to the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
"antrea.io/theia/pkg/util" | ||
) | ||
|
||
// throughputAnomalyDetectionEWMACmd represents the policy-recommendation delete command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
"start-time", | ||
"s", | ||
"", | ||
`The start time of the flow records considered for the policy recommendation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy recommendation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
"end-time", | ||
"e", | ||
"", | ||
`The end time of the flow records considered for the policy recommendation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
"antrea.io/theia/pkg/util" | ||
) | ||
|
||
// anomalyDetectionDeleteCmd represents the policy-recommendation delete command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy-recommendation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
It will return the recommended NetworkPolicies described in yaml.`, | ||
Args: cobra.RangeArgs(0, 1), | ||
Example: ` | ||
Get the recommendation result with job name tad-e998433e-accb-4888-9fc8-06563f073e86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
$ theia throughput-anomaly-detection retrieve tad-e998433e-accb-4888-9fc8-06563f073e86 | ||
Use Service ClusterIP when getting the result | ||
$ theia throughput-anomaly-detection retrieve tad-e998433e-accb-4888-9fc8-06563f073e86 --use-cluster-ip | ||
Save the recommendation result to file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
test/e2e/framework.go
Outdated
@@ -1,4 +1,4 @@ | |||
// Copyright 2022 Antrea Authors. | |||
// Copyright 2023 Antrea Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to update this when editing an existing file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
test/e2e/main_test.go
Outdated
@@ -1,4 +1,4 @@ | |||
// Copyright 2022 Antrea Authors | |||
// Copyright 2023 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
return nil | ||
} | ||
|
||
func tadrunJob(t *testing.T, data *TestData, algotype string) (stdout string, jobName string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this series of functions runJob
, getJobStatus
, listJobs
, deleteJob
, retrieveJobResult
in the policy recommendation e2e test, the only difference here is we are passing different command string. Could you please reuse these functions for these two e2e tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, added them to util.go
b360f56
to
48206a9
Compare
/theia-test-e2e |
@@ -0,0 +1,2 @@ | |||
--Drop table | |||
DROP tadetector_local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line at the end of the file.
DROP TABLE tadetector; | ||
|
||
CREATE TABLE IF NOT EXISTS tadetector AS tadetector_local | ||
engine=Distributed('{cluster}', default, tadetector_local, rand()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line at the end of the file
CREATE TABLE IF NOT EXISTS tadetector_local ( | ||
sourceIP String, | ||
sourceTransportPort UInt16, | ||
destinationIP String, | ||
destinationTransportPort UInt16, | ||
protocolIdentifier UInt16, | ||
flowStartSeconds DateTime, | ||
flowEndSeconds DateTime, | ||
throughputStandardDeviation Float64, | ||
algoType String, | ||
algoCalc Float64, | ||
throughput Float64, | ||
anomaly String, | ||
id String | ||
) engine=ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}', '{replica}') | ||
ORDER BY (flowStartSeconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the extra indentation
--Move data from old table and drop the old table | ||
INSERT INTO tadetector_local SELECT * FROM tadetector; | ||
DROP TABLE tadetector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not get why we need this step as I think in v0.4.0 we do not have table tadetector?
cmd/theia-manager/theia-manager.go
Outdated
@@ -54,6 +55,7 @@ func createAPIServerConfig( | |||
tlsMinVersion uint16, | |||
nprq querier.NPRecommendationQuerier, | |||
chq querier.ClickHouseStatQuerier, | |||
tad querier.ThroughputAnomalyDetectorQuerier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename it as tadq
to match the other naming?
@@ -0,0 +1,28 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe instead of having a separate anomalydetector
folder for TAD api, we can put these in intelligence, as TAD is a part of intelligence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have a strong opinion on this, but imo it would be better have the separate set of APIs in a separate folder and then the sub apis in the same folder, however there is another thought process that all the apis should be in one folder.
lemme know which one works best ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we should either moving anomalydetector
into intelligence
or renaming intelligence
to policyrecommendation
, otherwise the package naming strategy is not consistent. I personally prefer the first one, but do not have strong opinion. cc @dreamtalen @wsquan171 for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should place anomalydetector
under intelligence
, I remember that was our design when introducing theia manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I would like to defer it to #168
@@ -0,0 +1,212 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, maybe move files under anomalydetector/throughputanomalydetector
to intelligence/throughputanomalydetector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
|
||
// Run will create defaultWorkers workers (go routines) which will process the Service events from the | ||
// workqueue. | ||
func (c *AnomalyDetectorController) Run(stopCh <-chan struct{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also need a resync logic like we have here #150 for NPR. And maybe reformat or reuse some of the code. But I'm fine to do it in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I would like to defer it to #168
b30ce50
to
f5db175
Compare
pkg/controller/util.go
Outdated
@@ -135,7 +123,39 @@ func getPolicyRecommendationProgress(baseUrl string) (completedStages int, total | |||
return completedStages, totalStages, nil | |||
} | |||
|
|||
func deletePolicyRecommendationResult(connect *sql.DB, id string) (err error) { | |||
func GetTADetectorProgress(baseUrl string) (completedStages int, totalStages int, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is exactly the same as GetPolicyRecommendationProgress
, can you merge them into one, like GetSparkApplicationProgress
pkg/controller/util.go
Outdated
@@ -144,7 +164,16 @@ func deletePolicyRecommendationResult(connect *sql.DB, id string) (err error) { | |||
return nil | |||
} | |||
|
|||
func getPolicyRecommendationIds(connect *sql.DB) ([]string, error) { | |||
func DeleteTADetectorResult(connect *sql.DB, id string) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, the only difference of DeleteTADetectorResult
and DeletePolicyRecommendationResult
is table name, could you combine them by passing a table name as argument?
w := tabwriter.NewWriter(os.Stdout, 15, 8, 1, '\t', tabwriter.AlignRight) | ||
fmt.Fprintf(w, "id\tsourceIP\tsourceTransportPort\tdestinationIP\tdestinationTransportPort\tflowStartSeconds\tflowEndSeconds\tthroughput\talgoCalc\tanomaly\n") | ||
for _, p := range tad.Stats { | ||
fmt.Fprintf(w, "%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\t%v\n", p.Id, p.SourceIP, p.SourceTransportPort, p.DestinationIP, p.DestinationTransportPort, p.FlowStartSeconds, p.FlowEndSeconds, p.Throughput, p.AlgoCalc, p.Anomaly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you format the throughput
and algoCalc
in the table output? For example, in #168, the example output
id sourceIP sourceTransportPort destinationIP destinationTransportPort flowStartSeconds flowEndSeconds throughput algoCalc anomaly
eec9d1be-7204-4d50-8f57-d9c8757a2668 10.10.1.25 58076 10.10.1.33 5201 2022-08-11T06:26:54Z 2022-08-11T08:24:54 10004969097.000000000000000000 4.0063773860532994E9 true
I feel users will have a hard time when reading these two numbers 10004969097.000000000000000000
and 4.0063773860532994E9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Could be handled in #168
af49d03
to
e377768
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve this PR.
I understand that we have decided to defer addressing some comments due to a tight schedule. Could you please track those items somewhere as a list, such as in a comment here or in the description of your next PR, just to make sure we're not missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, only some nits
@@ -0,0 +1,212 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The the new files should have 2023 instead of 2022
// Copyright 2022 Antrea Authors | |
// Copyright 2023 Antrea Authors |
@@ -0,0 +1,223 @@ | |||
// Copyright 2020 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
// Copyright 2020 Antrea Authors | |
// Copyright 2023 Antrea Authors |
@@ -0,0 +1,650 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for new files in this folder
@@ -0,0 +1,43 @@ | |||
// Copyright 2022 Antrea Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for new files in this folder
build/charts/theia/values.yaml
Outdated
@@ -234,10 +234,10 @@ grafana: | |||
# volumeName: "" | |||
sparkOperator: | |||
# -- Determine whether to install Spark Operator. It is required to run Network | |||
# Policy Recommendation jobs. | |||
# Policy Recommendation and Throughput Anomaly detection jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Policy Recommendation and Throughput Anomaly detection jobs. | |
# Policy Recommendation and Throughput Anomaly Detection jobs. |
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "", | ||
`algorithm to use -> The algorithm used by throughput anomaly detection. | ||
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "", | |
`algorithm to use -> The algorithm used by throughput anomaly detection. | |
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`) | |
throughputAnomalyDetectionAlgoCmd.Flags().StringP("algo", "a", "", | |
`The algorithm used by throughput anomaly detection. | |
Currently supported Algorithms are EWMA, ARIMA and DBSCAN.`) |
This PR is resonsible for: 1. Adding throughput Anomaly detector files to Theia. 2. Adding CLI support. CLI support for get, list and status is present, commands are yet to be implemented 3. Adding CRD for the anomaly detector. 4. Adding e2e Test for all three algos of TAD. 5. Adding Unit tests 6. Using same spark operator for all spark jobs NOTE: TODO Add stat as an API Signed-off-by: Tushar Tathgur <tathgurt@tathgurtFLVDL.vmware.com>
e377768
to
a1f3750
Compare
/theia-test-e2e |
This PR is resonsible for:
1. Adding throughput Anomaly detector files to Theia.
2. Adding CLI support.
3. Adding CRD for the anomaly detector.
Signed-off-by: Tushar Tathgur tathgurt@tathgurtFLVDL.vmware.com