Latent Vulnerabilities for Software Vulnerability Prediction

This is the README file for the reproduction package of the paper: "Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study".

The package contains the following artefacts:

data_bigvul: contains the original and latent vulnerable functions and non-vulnerable functions from the Big-Vul dataset
data_devign: contains the original and latent vulnerable functions and non-vulnerable functions from the Devign dataset
linevul: contains original code from the state-of-the-art LineVul vulnerability prediction model
manual_analysis_rq: contains manual labeling and analysis of 140 samples of latent vulnerable functions from the Big-Vul and Devign datasets, as described in RQ1.

The size of the data, i.e., the data_bigvul and data_devign folders, is large, so they are not included in the GitHub repository. Please download them from this link instead.

Before running any code, please install all the required Python packages using the following command: pip install -r requirements.txt

The next steps are to run the scripts to generate the data splits for training, validating, and testing the models. The scripts for Big-Vul are python extract_data_bigvul.py and python extract_data_bigvul_latest.py. The scripts for Devign are python extract_data_devign.py and python extract_data_devign_latest.py. Note that the *_latest.py files are for splitting the data for the LIC scenario (having the same validation and testing sets, yet with different input latent vulnerable functions).

For RQ2 and RQ3, train and evaluate models by running sbatch with either evaluate_linevul_bigvul.sh (Baseline + Models with V-SZZ based latent SVs), evaluate_linevul_bigvul_latest.sh (Models with V-SZZ based latent SVs + Latest Introducing Commit), evaluate_linevul_bigvul_predict.sh (Models with V-SZZ based latent SVs + Self-Training), or evaluate_linevul_bigvul_centroid.sh (Models with V-SZZ based latent SVs + Centroid-based Removal)

These scripts will generate both results for the function and line levels.

Note that after these training/evaluation scripts finish, they will generate output folders, i.e., results_bigvul/ and results_devign/ containing the results (.csv files) for the RQ2 and RQ3 models, respectively.

These .csv result files can then be used for analysis and comparison as described in the paper.

The file Statistical-test-results_RQ2_RQ3.xlsx contains the results of the statistical tests for comparing the performance between the optimal models with latent SVs with the baseline models without latent SVs for both function-level and line-level tasks/metrics, according to Table 3 in the paper. Note that we only include the results in which using latent SVs was better than the baseline.

The file Low_Resource_Latent_SVs_Results.xlsx include the results of using a portion of original vulnerabilities (SVs) together with latent SVs with respect to using all original SVs, as discussed in "Use of Latent SVs for Low-Resource Projects" in section 7.1. Specifically, Devign required 50% of the original SVs + latent SVs, while Big-Vul required 70% of original SVs + latent SVs. These results mean that 60% of original SVs + latent SVs on average were required to surpass the performance of using all original SVs, as reported in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
linevul		linevul
manual_analysis_rq1		manual_analysis_rq1
LICENSE		LICENSE
Low_Resource_Latent_SVs_Results.xlsx		Low_Resource_Latent_SVs_Results.xlsx
README.md		README.md
Statistical-test-results_RQ2_RQ3.xlsx		Statistical-test-results_RQ2_RQ3.xlsx
evaluate_linevul.csv		evaluate_linevul.csv
evaluate_linevul_bigvul.py		evaluate_linevul_bigvul.py
evaluate_linevul_bigvul.sh		evaluate_linevul_bigvul.sh
evaluate_linevul_bigvul_centroid.py		evaluate_linevul_bigvul_centroid.py
evaluate_linevul_bigvul_centroid.sh		evaluate_linevul_bigvul_centroid.sh
evaluate_linevul_bigvul_latest.py		evaluate_linevul_bigvul_latest.py
evaluate_linevul_bigvul_latest.sh		evaluate_linevul_bigvul_latest.sh
evaluate_linevul_bigvul_predict.py		evaluate_linevul_bigvul_predict.py
evaluate_linevul_bigvul_predict.sh		evaluate_linevul_bigvul_predict.sh
evaluate_linevul_clean.csv		evaluate_linevul_clean.csv
evaluate_linevul_devign.py		evaluate_linevul_devign.py
evaluate_linevul_devign.sh		evaluate_linevul_devign.sh
evaluate_linevul_devign_centroid.py		evaluate_linevul_devign_centroid.py
evaluate_linevul_devign_centroid.sh		evaluate_linevul_devign_centroid.sh
evaluate_linevul_devign_latest.py		evaluate_linevul_devign_latest.py
evaluate_linevul_devign_latest.sh		evaluate_linevul_devign_latest.sh
evaluate_linevul_devign_predict.py		evaluate_linevul_devign_predict.py
evaluate_linevul_devign_predict.sh		evaluate_linevul_devign_predict.sh
extract_data_bigvul.py		extract_data_bigvul.py
extract_data_bigvul_latest.py		extract_data_bigvul_latest.py
extract_data_devign.py		extract_data_devign.py
extract_data_devign_latest.py		extract_data_devign_latest.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Vulnerabilities for Software Vulnerability Prediction

About

Releases

Packages

Languages

License

lhmtriet/Latent-Vulnerability

Folders and files

Latest commit

History

Repository files navigation

Latent Vulnerabilities for Software Vulnerability Prediction

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages