Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example
This platform is a research compendium of my academic publication below.
Gürol Canbek. Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example, Hittite Journal of Science & Engineering, 2020 (Submitted).
The DsFeatFreqComp package provides two categories of important functions: dataset manipulation and visualization.
- plotDsFreqDistributionViolin
- plotQQ
- plotPairwiseDsPValuesHeatMap
The dataset manipulation functions are
- loadDsFeatFreqsFromCsv2
- meltDataFrame
- loadPairwiseDsComparisonOfMeanRanks
- getPairwiseDsPValueMatrix
Researchers compare their Machine Learning (ML) classification performances with other studies without examining and comparing the datasets they used in training, validating, and testing. One of the reasons is that there are not many convenient methods to give initial insights about datasets besides the descriptive statistics applied to individual continuous or quantitative features. After demonstrating initial manual analysis techniques, this study proposes a novel adaptation of the Kruskal-Wallis statistical test to compare a group of datasets over multiple prominent binary features that are very common in today’s datasets. As an illustrative example, the new method was tested on six benign/malign mobile application datasets over the frequencies of prominent binary-features to explore the dissimilarity of the datasets per class. The feature vector consists of over a hundred “application permission requests” that are binary flags for Android platforms’ primary access control to provide privacy and secure data/information in mobile devices. Permissions are also the first leading transparent features for ML-based malware classification. The proposed data analytical methodology can be applied in any domain through their prominent features of interest. The results, which are also visualized in new ways, have shown that the proposed method gives the dissimilarity degree among the datasets. Specifically, the conducted test shows that the frequencies in the aggregated dataset and some of the datasets are not substantially different from each other even they are in close agreement in positive-class datasets. It is expected that the proposed domain-independent method brings useful initial insight to researchers on comparing different datasets.
Machine learning; binary classification; dataset comparison; dataset profiling; feature engineering; quantitative analysis; data quality; Android; malware detection
- Load the devtools package by
library(devtools)
. If the package is not installed already install the package byinstall.packages("devtools")
- Install the package from the repository by
install_github("gurol/DsFeatFreqComp")
.
Go the the file's directory and run the following command in terminal
install.packages('DsFeatFreqComp_1.0.0.tar.gz', repos = NULL, type = 'source')
DsFeatFreqDistFit - Dataset Feature-Frequency Distributions Fitting