Skip to content

implement of ML-based anomaly detection models to identify cyberattacks from NetFlow data

Notifications You must be signed in to change notification settings

Heewon-Hailey/cyberattack-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Cyberattack Detecting Models

About the project

It aims to implement two unsupervised machine learning models, Cluster-based Local Outlier Factors and Isolation Forest, to identify malicious attacks.

Dataset

A collection of traffic flow (NetFlow) data is used. The dataset is collected from from 01/12/12 03:24:33 to 10/12/12 23:24:32 and contains botnet traffic and normal traffic. Training and test dataset for unsupervised models consist of 14 features; timestamp, duration, protocol, source IP, source port, direction, destination IP, destination port, state, source type service, destination type service, total packets, transmitted bytes from both sides and transmitted bytes from source. In addition to the 14 features, the other datasets include true labels.

Training Dataset Valid Dataset Test Dataset
13,882,035 940,062 1,053,845

Identified Attack Scenarios

Several abnormal traffic patterns were observed in the test dataset from Splunk. Firstly, each traffic of UDP and ICMP protocol take up 21% and 39% in the total traffic flow. In general, UDP volume contributes less than 9% while less than 1% of ICMP volume occurs [1]. Second, port 53 was used the most through UDP and its traffic significantly increased for a short time. Besides, the traffic volume via ICMP protocol soared regularly with a similar pattern over time. Lastly, there was frequent volume change via port 25 which is the most often used in TCP protocol. According to the observed pattern, the test data could include port scan attacks, ICMP flood attack in Distributed Denial of Service (DDoS) and spamming. Therefore, I focuse on these attacks to implement machine learning models and detect them.

Feature Generation Methods

No. Method Applied Techniques (in order)
1 A A1. PCA (optimal dimension = 3)
A2. Embedded Selection (Information Entropy)
2 A+B B. Aggregation by time-based clustering
A1. PCA (optimal dimension = 5)
A2. Embedded Selection(Information Entropy)
3 A+B+C B. Aggregation of features by time-based clustering
C. Pearson's correlation coefficient-based filtering
A1. PCA (optimal dimension = 4)
A2. Embedded Selection (Information Entropy)

*Standard Scaler and Normalizer are applied

Detecting Algorithms

Two well-known anomaly detection algorithms, Cluster-based Local Outlier Factor (CBLOF) and Isolation Forest (IF), are considered for classifiers.

No. Model Tuned Params
1 CBLOF n_clusrers = 8
contamination = 0.1
2 IF n_estimators = 100
max_samples = 0.25
contamination = 0.1

Total 6 combinated models (3 Feature Generation Methods * 2 Detecting Algorithms) are examined to compare detected attacks.

Version

Pandas 1.3.1
Numpy 1.19.5
Matplotlib 3.4.2

References

[1] D. Lee, B. E. Carpenter and N. Brownlee. (2010). Observations of UDP to TCP Ratio and Port Numbers. 2010 Fifth International Conference on Internet Monitoring and Protection, Barcelona, pp. 99-104, doi: 10.1109/ICIMP.2010.20.

Releases

No releases published

Packages

No packages published