malware analysis of files
Introducing a malware into a computer and analysing it to find if it is a malware or not.
Malware, short for malicious software, includes a class of hostile or intrusive software. The taxonomy of malware classes have been versatile and broad, and antivirus software have been success-fully detecting the malware for decades. The traditional ways to detect malware are either signature based or behavior based. Signature based detection compares newly scanned software with the stored patterns of known malware, called ”signature” of malware. If high similarity score is found, a malware sample is discovered. Since the method captures the software sample, it is also called static method. Behavior based method, also called ’anomaly based’ method, tries to capture the anomalies in the runtime environment of the software, if abnormal behavior is observed,such as malicious network traffic, the software would be quarantined.
The method is also called dynamic method. In a real world scenario, dynamic methods generate lots of false positives, and static methods are generally preferred for antivirus. However, the traditional methods are not enough for the newly emerged classes of malware.The newly emerged malware are called ”polymorphic” and ”metamorphic” malware, able to reprogram themselves in different execution environments. The names indicate that traditional static analysis would not work, since the new malware no longer has a fixed signature attached to it. The nature of the problem fits to solve as a classification problem, since the predictive nature of machine learning algorithms can effectively classify polymorphic malware based on certain attributes observed in training set. Therefore, the motivation for the project is to use machine learning to classify known classes of polymorphic malware.
The data was parsed a numberof known malware samples into JSON files, with each JSON file representing a single malwarepiece. The original malware samples were encoded in the format specified in MIST format, which encodes the hex dumped assembly code of malware into more compact strings (Tirnius, et al,2009).Data preprocessing include to process all the JSON files included, and to construct pandas data frame from all the JSON files, with columns for attributes of each malware sample. Every JSON file has a specific malware name attached to it, so a column of label is added to the dataframe, which was later changed from categorical scale to numeric scale. Since the malware names are too specific, certain labels are grouped together. For example, ”Trojan-Destover-Sony-1201172” and ”Trojan.Dropper.Gen-1201172” are all labled as ”Trojan” and later as class label3.
The initial dataframe gives 137 features, and there are lots of NaN values in the dataframe. There-fore, feature selection is required, and certain statistical measures are applied. The first metric used is the number of non-NaNs for a feature: Among the 137 features, 9 features have more than 2000 non-NaN values, which makes them more proper candidates for training set, since more training data implies better classification result.The second metric used is the count for each distinct and unique values for a single attribute.If the distribution of a feature is too sparse, it is omitted, since this indicates higher entropy and less information used to classify. This metric results in 3 final candidates for attributes:peseccharacter,pesecentropy, pesecname. The final 3 attributes and the label were in the end merged intoa single pandas dataframe for classification task.
Since the class is prelabeled, a supervised learning algorithm is a plausible candidate for model. Since the problem is essentially multi-class, the candidate models chosen are Logistic Regression (LR), Random Forest (RF), and Multi-layer Perceptron (MLP), all have the capability to do multiclass classification.
Logistic Regression does not require a linear relationship between variables, and the errors neednot be normally distributed. Since little can be assumed about the training set, it is a good bench-mark for testing. However, there are still some assumptions made: first, logistic regression re-quires observations to be independent of each other. Second, independent variables should not be too higly correlated, and finally, it assumes linearity of log odds.
These assumptions are fine for the dataset, since malware samples are not from a single source, but from different hosts on different Operating systems.Random Forest: since random forest model makes best split and performs best when the samples are independent, which is the case here. Other than that, random forest makes no assumption about the distribution of data, so it is a great classifier for the task at hand.Multilayer Perceptron, or Feedforward Neural Network: since neural network is a universal function fitting model, no assumption is needed to made here about the dataset. Therefore, neural network is a good model to test on the dataset.
4.2 Feasibility and Comparison
The assumptions are reasonable from the three different models, and logistic regression has themost assumptions. Since all the models are capable of doing multiclass classification tasks, solving the model is simple. In the end, each attribute is set as the training data and the label as thetarget value, and the performance of models on each attributes should be compared to see whichmodels give the best performance.Other possible choices include Support Vector Machine, K-nearest neighbor, Decision tree. Unsupervised learning algorithms such as K-means, Gaussian Mixture, and DBSCAN can be consid-ered as well. However, since the dataset is prelabeled, and supervised learning algorithm generally performs better, they are omitted. Decision tree is a rudimentary version of Random Forest, so it is not considered. SVM could be a plausible candidate, in the end, the models used are LR, RF.
The models used in the project are directly from python’s scikit-learn module, therefore, the work to be done are mainly hyperparamter tuning. The hyperparamter tuning algorithm for themodels in general is: The algorithm is straighforward in that it picks the set of hyperparameters that have the highest F1 score, a popular matric for classifier performance.The default algorithms used in MLP is stochastic gradient descent, and the default activationfunction used by MLP and logistic regression in the case of the project is the softmax function and the loss function the Cross-Entropy function
They make sense, because we have multiclass classification, and that Newton’s method is too ex-pensive in this case.
For feature 1peseccharacter, the summary for model performance, using 0.33 train-test split on the dataset, is The attribute performs great for Crypto and bad for Locker, MLP and LR did bad on the other features, mainly because the existence of NAs in the dataset. But Random Forest givesbetter performance compared to other two classifiers.
The results are similar to the previous, with Random Forest overfitting the other categories,which lack sample size, and WMIGhost, which only has 9 samples in the entire attributes, have zero score across all combinations.