Skip to content

DeepBugs is a framework for learning bug detectors from an existing code corpus.

License

Notifications You must be signed in to change notification settings

livingdan/DeepBugs_replication

This branch is 24 commits ahead of michaelpradel/DeepBugs:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

bd58df1 · Apr 30, 2024

History

89 Commits
Oct 12, 2020
Apr 19, 2024
Aug 17, 2020
May 30, 2020
Apr 27, 2024
Apr 18, 2024
Jul 30, 2020
Sep 25, 2020
Nov 23, 2017
Nov 23, 2017
Jul 2, 2020
Apr 30, 2024
Apr 30, 2024
Apr 30, 2024
Nov 23, 2017
Apr 30, 2024
Apr 30, 2018
Jun 13, 2018
Apr 8, 2024
Apr 19, 2019
Feb 8, 2018

Repository files navigation

DeepBugs Replication and Extension using CNN model architecture.

DeepBugs is a framework for learning name-based bug detectors from an existing code corpus. See our OOPSLA'18 paper for a detailed description.

I have extended deepbugs to use CNN architecture which improves the performance of the 3 bug detectors. However this results in increased training time.

Use the 3 python notebooks, one for each bug detector from the paper to compare performance of new model.

  • DeepBugs_replicated_BinOperator_CNN_Comparison.ipynb
  • DeepBugs_replicated_IncorrectBinaryOperand_CNN_Comparison.ipynb
  • DeepBugs_replicated_swapped_args_CNN_Comparison.ipynb

To run the notebooks as is requires Google Colab Pro's High-Ram machine with T4 GPU, reducing the dataset further may allow to run on low ram machines.

Follow the steps in the notebooks.

  1. Clone respository

  2. download dataset

  3. Optional (embeddings) models use pre embedded file

3.1 extract traning examples

  • option 1 unzip previously extracted files

  • option 2 extract own training examples

3.2 train and evalute original model and our CNN Model to see results

Overview

  • All commands are called from the main directory.
  • Python code (most of the implementation) and JavaScript code (for extracting data from .js files) are in the /python and /javascript directories.
  • All data to learn from, e.g., .js files are expected to be in the /data directory.
  • All data that is generated, e.g., intermediate representations, are written into the main directory. It is recommended to move them into separate directories.
  • All generated data files have a timestamp as part of the file name. Below, all files are used with *. When running commands multiple times, make sure to use the most recent files.

Requirements

  • Node.js
  • npm modules (install with npm install module_name): acorn, estraverse, walk-sync
  • Python 3
  • Python packages: keras, scipy, numpy, sklearn

JavaScript Corpus

  • The full corpus can be downloaded here and is expected to be stored in data/js/programs_all. It consists of 100.000 training files, listed in data/js/programs_training.txt, and 50.000 files for validation, listed in data/js/programs_eval.txt.
  • This repository contains only a very small subset of the corpus. It is stored in data/js/programs_50. Training and validation files for the small corpus are listed in data/js/programs_50_training.txt and data/js/programs_50_eval.txt.

Learning a Bug Detector

Creating a bug detector consists of two main steps:

  1. Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.
  2. Train a classifier to distinguish correct from incorrect code examples.

Each bug detector addresses a particular bug pattern, e.g.:

  • The SwappedArgs bug detector looks for accidentally swapped arguments of a function call, e.g., calling setPoint(y,x) instead of setPoint(x,y).
  • The BinOperator bug detector looks for incorrect operators in binary operations, e.g., i <= len instead of i < len.
  • The IncorrectBinaryOperand bug detector looks for incorrect operands in binary operations, e.g., height - x instead of height - y.

Step 1: Extract positive and negative training examples

node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50

  • The --parallel argument sets the number of processes to run.
  • programs_50_training.txt contains files to include (one file per line). To extract data for validation, run the command with data/js/programs_50_eval.txt.
  • The last argument is a directory that gets recursively scanned for .js files, considering only files listed in the file provided as the second argument.
  • The command produces calls_*.json files, which is data suitable for the SwappedArgs bug detector. For the other bug two detectors, replace calls with binOps in the above command.

Step 2: Train a classifier to identify bugs

A) Train and validate the classifier python3 python/BugLearnAndValidate.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json --validation_data calls_yy*.json

  • The first argument selects the bug pattern.
  • The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  • The remaining arguments are two lists of .json files. They contain the training and validation data extracted in Step 1.
  • After learning the bug detector, the command measures accurracy and recall w.r.t. seeded bugs and writes a list of potential bugs in the unmodified validation code (see poss_anomalies.txt).

B) Train a classifier for later use python3 python/BugLearn.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json

  • Optionally, pass --out some/dir to set the output directory for the trained model.

Note that learning a bug detector from the very small corpus of 50 programs will yield a classifier with low accuracy that is unlikely to be useful. To leverage the full power of DeepBugs, you'll need a larger code corpus, e.g., the JS150 corpus mentioned above.

Finding Bugs

Finding bugs in one or more source files consists of these two steps:

  1. Extract code pieces
  2. Use a trained classifier to identify bugs

Step 1: Extract code pieces

node javascript/extractFromJS.js calls --files <list of files>

  • contains one or more files to be examined. Code pieces can be extracted from any javascript file (.js) given with path specification relative to the main directory.
  • The command produces calls_*.json files, which is data suitable for the SwappedArgs bug detector. For the other bug two detectors, replace calls with binOps in the above command.

Step 2: Use a trained classifier to identify bugs

python3 python/BugFind.py --pattern SwappedArgs --threshold 0.95 --model some/dir --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data calls_xx*.json

  • The first argument selects the bug pattern.
  • 0.95 is the threshold for reporting bugs; higher means fewer warnings of higher certainty.
  • --model sets the directory to load a trained model from.
  • The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  • The remaining argument is a list of .json files. They contain the data extracted in Step 1.
  • The command examines every code piece and writes a list of potential bugs with its probability of being incorrect

Embeddings for Identifiers

The above bug detectors rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped token_to_vector.json file. Alternatively, you can learn the embeddings via Word2Vec as follows:

  1. Extract identifiers and tokens:

node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50

  • The command produces tokens_*.json files.
  1. Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):

python3 python/TokensToTopTokens.py tokens_*.json

  • The arguments are the just created files.
  • The command produces encoded_tokens_*.json files and a file token_to_number_*.json that assigns a number to each identifier and literal.
  1. Learn embeddings for identifiers and literals:

python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json

  • The arguments are the just created files.
  • The command produces a file token_to_vector_*.json.

About

DeepBugs is a framework for learning bug detectors from an existing code corpus.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 70.2%
  • JavaScript 23.2%
  • Python 6.5%
  • Shell 0.1%