Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

DsFeatFreqComp – Dataset Feature-Frequency Comparison R Package: A Research Compendium of

Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

This platform is a research compendium of my academic publication below.

Gürol Canbek. Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example, Hittite Journal of Science & Engineering, 2020 (Submitted).

The DsFeatFreqComp package provides two categories of important functions: dataset manipulation and visualization.

Related visualization functions:

plotDsFreqDistributionViolin
plotQQ
plotPairwiseDsPValuesHeatMap

Dataset manipulation functions:

The dataset manipulation functions are

loadDsFeatFreqsFromCsv2
meltDataFrame
loadPairwiseDsComparisonOfMeanRanks
getPairwiseDsPValueMatrix

Abstract

Researchers compare their Machine Learning (ML) classification performances with other studies without examining and comparing the datasets they used in training, validating, and testing. One of the reasons is that there are not many convenient methods to give initial insights about datasets besides the descriptive statistics applied to individual continuous or quantitative features. After demonstrating initial manual analysis techniques, this study proposes a novel adaptation of the Kruskal-Wallis statistical test to compare a group of datasets over multiple prominent binary features that are very common in today’s datasets. As an illustrative example, the new method was tested on six benign/malign mobile application datasets over the frequencies of prominent binary-features to explore the dissimilarity of the datasets per class. The feature vector consists of over a hundred “application permission requests” that are binary flags for Android platforms’ primary access control to provide privacy and secure data/information in mobile devices. Permissions are also the first leading transparent features for ML-based malware classification. The proposed data analytical methodology can be applied in any domain through their prominent features of interest. The results, which are also visualized in new ways, have shown that the proposed method gives the dissimilarity degree among the datasets. Specifically, the conducted test shows that the frequencies in the aggregated dataset and some of the datasets are not substantially different from each other even they are in close agreement in positive-class datasets. It is expected that the proposed domain-independent method brings useful initial insight to researchers on comparing different datasets.

Keywords

Machine learning; binary classification; dataset comparison; dataset profiling; feature engineering; quantitative analysis; data quality; Android; malware detection

Package Installation

From this GitHub repository

Load the devtools package by library(devtools). If the package is not installed already install the package by install.packages("devtools")
Install the package from the repository by install_github("gurol/DsFeatFreqComp").

From local package archive file downloaded on a computer

Go the the file's directory and run the following command in terminal install.packages('DsFeatFreqComp_1.0.0.tar.gz', repos = NULL, type = 'source')

The extra information about datasets

DsFeatFreqDistFit - Dataset Feature-Frequency Distributions Fitting

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
R		R
data		data
inst		inst
man		man
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
dsfeatfreqcomp.Rproj		dsfeatfreqcomp.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DsFeatFreqComp – Dataset Feature-Frequency Comparison R Package: A Research Compendium of

Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

Related visualization functions:

Dataset manipulation functions:

Abstract

Keywords

Package Installation

From this GitHub repository

From local package archive file downloaded on a computer

The extra information about datasets

About

Releases 1

Packages

Languages

License

gurol/DsFeatFreqComp

Folders and files

Latest commit

History

Repository files navigation

DsFeatFreqComp – Dataset Feature-Frequency Comparison R Package: A Research Compendium of

Gaining New Insight in Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

Related visualization functions:

Dataset manipulation functions:

Abstract

Keywords

Package Installation

From this GitHub repository

From local package archive file downloaded on a computer

The extra information about datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages