Skip to content

Latest commit

 

History

History
174 lines (143 loc) · 8.02 KB

USAGE.md

File metadata and controls

174 lines (143 loc) · 8.02 KB

USAGE

Instruction to runs this experiments. More info about how works click here.

TABLE OF CONTENTS

Getting Started

This repo contains three folders that are part of our tools for this project.

  • notebooks : are Jupyter notebooks with data processing pipeline, keras models, and analysis.
  • py_scripts: this python scripts contains training model pipeline, keras models, data generation, shell script generation and stats generation.
  • scripts : this contains useful automatize code, for data generation, data labeling, and models using vowpal_wabbit inside vg.

Dependecies

  • vg
  • Jupyter Notebook
  • SciPy
  • Keras
  • R

Install

git clone https://github.com/binarySequoia/vg_recal.git

Get Data

cd vg_recal
wget -r -np -nH --cut-dirs=3 -R "index.html*" http://public.gi.ucsc.edu/~anovak/outbox/jeffrey/data/

Building a vg graph

vg construct -r [fasta file] -v [vcf file] > vg_graph.vg

We are using vg graphs in xg/gcsa index pair format. To create this format run the following line.

# store the graph in the xg/gcsa index pair
vg index -x vg_graph.xg -g vg_graph.gcsa -k 16 vg_graph.vg

If you want to know more about vg functionalities, please go to vg.

NOTE: The data folder already contains the xg and gcsa files to work.

Reads Simulation

Training and testing data generation

To generate data we are using vg sim.

vg sim -x [xg file] -n [data size] -l [read size] \
       -i [indel rate] --sub-rate [base substitution rate] --random-seed 1 -a > simulated.gam

To create our training amd testing reads we run:

# Generate Training data
vg sim -x data/genome_data/snp1kg-CHR21.xg -n 1000000 -l 100 -i 0.001 --sub-rate 0.01 --random-seed 1 -a > train_sim_len100.gam
# Genearte Testing data
vg sim -x data/genome_data/snp1kg-CHR21.xg -n 1000000 -l 100 -i 0.001 --sub-rate 0.01 --random-seed 42 -a > test_sim_len100.gam

These scripts create a 1,000,000 reads for testing and training. Each read has 100 base pair with a error sample of 0.01.

Label training and testing data

After run vg sim we have simulated reads that are similar to some region in the genome graph but not the same. Now for label the read we are will try map those reads back to the same place in the graph. If a read is map in the same place we label that read as correct, if not we label that read as incorrect.

Now, for do that we need to use vg map and vg annotate.

vg map -x [xg file] -g [gcsa file] -G simulated.gam | vg annotate -x [xg file]  -a - -p > mapped.gam

To Mapping our training amd testing reads we run:

# Mapping Training data
vg map -x data/genome_data/snp1kg-CHR21.xg -g data/genome_data/snp1kg-CHR21.gcsa 
     -G train_sim_len100.gam | vg annotate -x -x data/genome_data/snp1kg-CHR21.xg  -a - -p > train_mapped_len100.gam
# Mapping Testing data
vg map -x data/genome_data/snp1kg-CHR21.xg -g data/genome_data/snp1kg-CHR21.gcsa 
     -G test_sim_len100.gam | vg annotate -x -x data/genome_data/snp1kg-CHR21.xg  -a - -p > test_mapped_len100.gam

After have these mapping, now we can label them.

vg gamcompare -r [distance threshold] mapped.gam simulated.gam > labeled.gam

The distance threshold means distance within which to consider reads correct.

For label our data we run:

# Label Train reads
vg gamcompare -r 100 train_mapped_len100.gam train_sim_len100.gam > train_compared_len100.gam
# Label Test reads
vg gamcompare -r 100 test_mapped_len100.gam test_sim_len100.gam > test_compared_len100.gam

Models

At the begining of this project we are using C++ to develop our models using vowpal-wabbit. But because we want to try more deep learning approaches we move to python for fast prototyping and when we got a good model load that model in a C++ framework. In this repo you can found implementation of sklearn and keras, but vowpal-wabbit implemetation you can found in this special fork of vg.

model framework
Logistic Regression with mappinq quality and metadata vowpal-wabbit
Logistic Regression with Bag of words vowpal-wabbit
Logistic Regression with mems vowpal-wabbit
Logistic Regression with mems stats vowpal-wabbit
Logistic Regression with mappinq quality sklearn
Logistic Regression with mappinq quality and metadata sklearn
Neural Network with mapping quality and metadata keras
Neural Network with bag of words keras

vowpal wabbit model

Vowpal Wabbit models are inside in my special fork of vg. You can train using vg recalibrate.

vg recalibrate
:'usage: vg recalibrate [options] --model learned.model mapped.gam > recalibrated.gam
       vg recalibrate [options] --model learned.model --train compared.gam

options:
    -T, --train              read the input GAM file, and use the mapped_correctly flags from vg gamcompare to train a model
    -m, --model FILE         load/save the model to/from the given file
    -t, --threads N          number of threads to use
    -b  --bow                bag of words as features
    -e  --mems               add mems as features
    -s  --memstats           add mems stats as features
    -o  --nomapq             remove mapping quality features'

Train

vg recalibrate --model recal.model --train train_compared_len100.gam

Test

vg recalibrate has multiple options for choose a model.

# Prediction for training data
vg recalibrate --model recal.model train_mapped_len100.gam > train_recalibrated.gam
# Prediction for testing data
vg recalibrate --model recal.model test_mapped_len100.gam > test_recalibrated.gam

Evaluate

# Load recalibrated data
vg gamcompare -r 100 - simulated_len100.gam --tsv --aligner recal > stats_len100.tsv
# Load original data
vg gamcompare -r 100 mapped_len100.gam simulated_len100.gam --tsv --aligner orig | sed 1d >>stats_len100.tsv

Analyze Evaluation

Rscript vg/scripts/plot-qq.R stats_len100.tsv qq-plot_len100.svg
Rscript vg/scripts/plot-roc.R stats_len100.tsv roc-plot_len100.svg

other models

Sklearn and keras models are in jupyter notebooks and py_script folders. We have pipeline to build stats.tsv for evaluate keras and sklearn models. After create the stats.tsv runs these scripts to analyze the performace of the models.

Notebooks

We have jupyter notebooks, that are part of our pipeline. If you want know more about it go to notebooks

py_scripts

We have python scripts that are part of our pipeline. Some of those script are the same as jupyter notebook but wrote as script to be able to run it as background job. If you want know more about it go to py_scripts

scripts

We have shell scripts that are part of our pipeline. Here you can found scripts for data generation and data labeling using vg. Logistic regression using vowpal-wabbit were trained and testing using these scripts. If you want know more about it go to scripts