Skip to content

A reproducible DVC pipeline from the notebook Detecting Algorithmically Generated Domains using DVC

License

Notifications You must be signed in to change notification settings

davidelofrese/dvc-dga-notebook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A reproducible DVC pipeline from the notebook Detecting Algorithmically Generated Domains

The aim of this exercise, which is the lab assignment for the Social Computing course, is to create a reproducible pipeline from the notebook Detecting Algorithmically Generated Domains using DVC, an open-source version control system for data science and machine learning projects.

The notebook, part of the Data Hacking collection on security-oriented data analysis, is related to the classification of legit domains and domains generated by a DGA (Dynamic Generation Algorithm).

The following section describe the process followed to create the pipeline starting from the notebook. If you are not interested in this, you can skip directly to the setup instructions.

Process

πŸ—ƒοΈ Raw datasets download

After downloading and understanding the content of the notebook, the raw datasets have been downloaded from the Git repository associated to the notebook and, after the initialization of DVC (with the command dvc init), they have been copied in the folder data/raw and tracked using the dvc add command.

The dataset used are three:

  • alexa_1M: contains a list of the top domain names from Alexa
  • dga_domains: contains a list of DGA domain names
  • words.txt: a list of words taken from the dictionary

Notice that the original repository contains two different Alexa datasets: alexa_100k and alexa_1M. The former contains a list of the top 100k domains from Alexa, while the latter contains a list of the top 1M domains. In this exercise, it has been chosen to use the second dataset alexa_1M because it has a bigger file size and it is more suitable for the purpose of this exercise (e.g. it is a compressed .zip file and thus an additional decompressing stage can be added to the pipeline). On the other hand, the original notebook uses the first dataset (but just for speed reasons).

πŸ“’ Notebook code cells to Python modules

First of all, the code cells of the notebook have been converted in a single Python script using the nbconvert tool with the following command:

jupyter nbconvert --to script DGA_Domain_Detection.ipynb

Following this, the single Python script has been analyzed in detail to identify the main stages of the data science experiment. In particular, four main stages have been identified:

  1. Data preparation
  2. Feature engineering
  3. Training
  4. Evaluation

These stages are further divided in sub-stages. For instance, the data preparation stage contains the sub-stage which prepares the Alexa dataset, the one which prepares the DGA dataset, etc.

Then, the single script has been divided in several Python modules, each of which contains the code to execute a stage of the pipeline. During this process, it has been tried to preserve as much of the original code as possible, but in some cases it has been necessary to correct some syntactical errors and update/adapt the code to make it work with the current version of the libraries used. Moreover, all the code that just print some strings on the standard output has been removed.

It is also important to mention that some parts of the notebook (e.g., the RandomForestClassifier class) return results based on a pseudo-random number generator. Hence, in contrast with the original notebook, it has been chosen to fix a random seed in order to achieve a deterministic behavior and output reproducible results.

The modules extracted from the notebook are in the src folder of this repository and they are organized in multiple sub-folders according to the main stage that they perform: preparation, feature, models and evaluation. Each of the modules take in input a set of arguments, that allow to specify the input files, and produces one or more outputs. Except from the evaluation module, all the other modules output .pkl files using the Python library pickle.

βš™οΈ Definition of the parameters

Some of the functions defined in the Python modules have specific parameters and hyper-parameters that can be changed to alter the experiment and the output results.

In order to track these parameters and to allow the comparison of multiple runs, it has been chosen to add the file params.yaml, a YAML file which contains the parameters for the modules organized in a hierarchy. The content of the YAML file is reported below. Notice the three top level keys preparation, features and models which corresponds to some of the previously mentioned stages.

preparation:
  seed: 0

features:
  alexa_vectorization:
    range_low: 3
    range_high: 5
    min_df: 0.0001
    max_df: 1.0
    
  words_vectorization:
    range_low: 3
    range_high: 5
    min_df: 0.00001
    max_df: 1.0

models:
  n_estimators: 20
  seed: 0

πŸ” Creation of the reproducible pipeline

Each stage of the pipeline has been added using the dvc run command, specifying the input dependencies, the outputs and the command to execute. Except for the first stage (extract-alexa) where the command used is unzip, all the other stages execute one of the Python modules extracted from the code in the notebook (as described in the previous paragraph).

In addition to the parameters -d and -o to track the input dependencies and the output of each stage, two other parameters have been used: -p and --plots. The first one is used to specify one or more parameters the stage depends on (to allow to re-execution of the stage when the parameters change), the second instead, allow to define a special output produced by the stage, that is a plot metrics file (more on this in the Plot metrics section).

The outputs of each stage are stored inside the data folder where they are organized in multiple sub-folders according to the stage. The structure of the data folder is reported below.

$ tree
.
β”œβ”€β”€ build-features
β”‚Β Β  β”œβ”€β”€ test_set.pkl
β”‚Β Β  └── training_set.pkl
β”œβ”€β”€ evaluate
β”‚Β Β  └── classes.csv
β”œβ”€β”€ extract-alexa
β”‚Β Β  └── alexa_1M.csv
β”œβ”€β”€ merge-test
β”‚Β Β  └── merged_test_set.pkl
β”œβ”€β”€ merge-training
β”‚Β Β  └── merged_training_set.pkl
β”œβ”€β”€ prepare-alexa
β”‚Β Β  └── alexa_prepared.pkl
β”œβ”€β”€ prepare-dga
β”‚Β Β  └── dga_prepared.pkl
β”œβ”€β”€ prepare-words
β”‚Β Β  └── words_prepared.pkl
β”œβ”€β”€ raw
β”‚Β Β  β”œβ”€β”€ alexa_1M.zip
β”‚Β Β  β”œβ”€β”€ dga_domains.txt
β”‚Β Β  └── words.txt
β”œβ”€β”€ split-alexa
β”‚Β Β  β”œβ”€β”€ alexa_test.pkl
β”‚Β Β  └── alexa_train.pkl
β”œβ”€β”€ split-dga
β”‚Β Β  β”œβ”€β”€ dga_test.pkl
β”‚Β Β  └── dga_train.pkl
└── train-model
    └── trained_model.pkl

High level description of the stages

The following table reports an high level description of each stage in the pipeline. To look more in depth at the stages, open the file dvc.yaml where all the stages that form the pipeline are specified, along with the command, the input dependencies, the outputs, etc.

Name Description
extract-alexa It extracts the .zip file that contains the Alexa dataset.
prepare-alexa It prepares the Alexa dataset removing blank lines, setting the class of the domains, etc.
prepare-dga It prepares the DGA domains dataset removing blank lines, setting the class of the domains, etc.
prepare-words It prepares the words dataset removing duplicate words, lowering the case, etc.
split-alexa It splits the prepared Alexa dataset in training set (90%) and test set (10%).
split-dga It splits the prepared DGA dataset in training set (90%) and test set (10%).
merge-training It merges the two training sets (Alexa training set and DGA training set) into a single one.
merge-test It merges the two test sets (Alexa test set and DGA test set) into a single one.
build-features It builds the additional features that will be used to train the model. In particular, it adds the length and the entropy of each domain, and performs the vectorization.
train-model It trains a Random Forest classifier model.
evaluate It evaluates the performances of the trained model using the test set.

Directed acyclic graph (DAG) of the pipeline

DVC represents the stages of the pipeline through a directed acyclic graph with each of the stages in the pipeline as nodes. The graph for this exercise, generated with the dvc dag command, is reported below:

$ dvc dag
                                   +--------------+
                                   | data\raw.dvc |***
                               ****+--------------+   *******
                           ****                  **          *******
                      *****                        **               *******
                   ***                               **                    *******
     +---------------+                                 **                         ****
     | extract-alexa |                                  *                            *
     +---------------+                                  *                            *
              *                                         *                            *
              *                                         *                            *
              *                                         *                            *
     +---------------+                          +-------------+                      *
     | prepare-alexa |                          | prepare-dga |                      *
     +---------------+                          +-------------+                      *
      **            **                                  *                            *
    **                **                                *                            *
  **                    **                              *                            *
**                  +-------------+              +-----------+                       *
*                   | split-alexa |              | split-dga |                       *
*                   +-------------+*****    *****+-----------+                       *
*                           *           ****            *                            *
*                           *      *****    *****       *                            *
*                           *   ***              ***    *                            *
****                +------------+             +----------------+            +---------------+
    *******         | merge-test |             | merge-training |          **| prepare-words |
           ******** +------------+             +----------------+   *******  +---------------+
                   *******        ***            **          *******
                          *******    **        **     *******
                                 ****  **    **   ****
                                  +----------------+
                                  | build-features |
                                  +----------------+
                                   **            **
                                 **                **
                               **                    **
                      +-------------+                  **
                      | train-model |                **
                      +-------------+              **
                                   **            **
                                     **        **
                                       **    **
                                     +----------+
                                     | evaluate |
                                     +----------+

πŸ“Š Plot metrics

The last stage of the pipeline (evaluate) outputs a CSV file which contains a row for each of the test domains and two columns: class and pred. The first is the actual class of the domain (legit or dga), the latter is the class predicted by the trained model.

Using the dvc plots command, it is possible to generate a confusion matrix to visualize the performance of the trained model (the exact command to generate the plot is reported in the Plot confusion matrix section).

Below is reported the confusion matrix generated after the training and the evaluation stages.

Setup

The following instructions explain how to get a copy of the project on the local machine and setup it for development and testing purposes.

The project (and the instructions that follow) has been developed and tested on Windows 10, using Git Bash for Windows, Python 3.9.1 and DVC 1.11.8.

⚠ These instructions assumes that Python 3.6+ and DVC are already installed on the local machine. If they are not, please install them before executing the following commands.

❗ Important warning for Windows users ❗

The stage extract-alexa depends on the command unzip, whose availability is platform specific and it is not available out of the box in Windows command prompt and in PowerShell. To make sure that the pipeline is reproducible, it is strongly recommended to execute the following instructions using a bash shell emulator (such as Git Bash or Cmder) or using Windows Subsystem for Linux.

Download

To download the content of this repository on the local machine simply execute the commands:

git clone https://github.com/davidelofrese/dvc-dga-notebook.git
cd dvc-dga-notebook

Configuration

Before reproducing the pipeline, create a Python virtual environment:

python -m venv .env

Then, activate it:

  • Windows
source .env/Scripts/activate
  • POSIX
source .env/bin/activate

Next, install the dependencies listed in the requirements.txt file:

pip install -r requirements.txt

Data download

This exercise includes a preconfigured Google Drive remote storage that contains the raw datasets and the outputs of each stage of the pipeline.

$ dvc remote list
gdrive  gdrive://14uIMjkjSUbisQ-quCbMhLUCU5f6DYo94

Before reproducing the pipeline, download the tracked files from the remote storage on the local machine running the command:

dvc pull

⚠ On the first execution of the pull command, DVC will generate an URL to authorize the access to Google Drive. Therefore, it is required to open the URL, sign into a Google account and grant DVC the necessary permissions. This will produce a verification code needed to complete the connection. Additional information on the authorization process are reported in the DVC documentation.

Run

To reproduce the entire pipeline, run the command:

dvc repro

⚠ Warning: if the pipeline is reproduced right after a dvc pull command without any change to the stages' dependencies or parameters, the following output will be produced by DVC:

$ dvc repro
'data\raw.dvc' didn't change, skipping
Stage 'prepare-words' didn't change, skipping
Stage 'extract-alexa' didn't change, skipping
Stage 'prepare-alexa' didn't change, skipping
Stage 'split-alexa' didn't change, skipping
Stage 'prepare-dga' didn't change, skipping
Stage 'split-dga' didn't change, skipping
Stage 'merge-test' didn't change, skipping
Stage 'merge-training' didn't change, skipping
Stage 'build-features' didn't change, skipping
Stage 'train-model' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.

Hence, to reproduce the pipeline even if no changes were made, use the --force parameter to force the execution:

dvc repro --force

Otherwise, change some parameters (in params.yaml file) or some dependencies and run the command dvc repro (without parameters) to reproduce the pipeline.

Plot confusion matrix

After the reproduction of the pipeline, it is possible to generate the confusion matrix based on the output of the evaluation stage with the following command:

dvc plots show data/evaluate/classes.csv --template confusion -x pred -y class

This command will generate an HTML file (plots.html) which can be opened with a web browser to visualize the confusion matrix.

Resources & Libraries

  • pandas - A data analysis and manipulation tool
  • numpy - Add support for large, multi-dimensional arrays and matrices
  • tldextract - A tool to separate the TLD from the registered domain and subdomains of a URL
  • scikit-learn - A set of tools for predictive data analysis
  • pyaml - Module to read and write YAML-serialized data

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A reproducible DVC pipeline from the notebook Detecting Algorithmically Generated Domains using DVC

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published