MalwareDetection.txt

Malware Detection Using Machine Learning

Introduction
Includes the idea and working of different types of threats like Malware, Spam, Phishing, Exploits, Virus, Worm, Trojan, Spyware, Logic, bombs and Rootkit.

Working Of Anti-virus/Anti-malware
# The conventional way of malware detection uses SIGNATURE BASED DETECTION. In this method attributes(which can be a particular piece of code,
a string or any other thing) that are extracted from the malware are matched with every file and if a file contains that same attribute then
it is declared as a malware and terminated.
# Other way is heuristics based approach i.e. ANOMALY BASED DETECTION. Certain anomaly activities are listed and if a particular file perform 
those activities is treated as malware and terminated. But, these methods have drawbacks like 
1. The attributes(may not be unique) that are used to search for malware can also be present in some other file and a non-suspicious file can 
also get terminated.
2. It cannot terminate a brand new malware as it has to be reverse engineered to take out its attribute which takes time.
3. It is unable to detect polymorphic malware that has an ability to change its signatures.


Machine Learning
Supervised Learning : learning is based on labeled data. In this case, we have an initial dataset, where data samples are mapped to the correct outcome. The model is trained on this dataset, where it knows the correct results. e.g.
# Regression - Predict the value based on previous observations, i.e. values of the samples from the training set.
# Classification - Based on the set of labeled data, where each label defines a class, that the sample belongs to, we want to predict the class for the previously unknown sample. The set of possible outputs is finite and usually small. 
Unsupervised Learning : there is no initial labeling of data. Here the goal is to find some pattern in the set of unsorted data, instead of predicting some value. e.g.
# Clustering - Find the hidden patterns in the unlabeled data and separate it into clusters according to similarity.
From machine learning perspective, malware detection can be seen as a problem of classification or clusterization: unknown malware types 
should be clusterized into several clusters, based on certain properties, identified by the algorithm.

Following application of machine learning is to be used in Malware Detection : 
#Data Mining - N-grams, API/System calls(Zero Day malware detection), Assembly Instructions, Hybrid Features.
#Neural Network
#Deep Learning

ML Algorithms to be used in Malware Detection :
1. K-Nearest Neighbours
2. Support Vector Machines
3. Naive Bayes
4. J48 Decision Tree
5. Random Forest

Although it was supposed that the use of ML would exponentially increse the accuracy but can't.
In classification problems, different models gave different results. The lowest accuracy was achieved by Naive Bayes (72.34% and 55%), followed by k-Nearest-Neighbors and Support Vector Machines (87%-94.6% and 87.6%-94.6% respectively).
The highest accuracy was achieved with the J48 and Random Forest models, and it was equal to 93.3% and 95.69% for multi-class classification and 94.6% and 96.8% for binary classification respectively.


******************************
Other AI projects and ideas :
1.Detecting Suspicious URL
2.Identifying spam messege
3.Intrusion Deection system