DeepSVP is a computational method to prioritize structural variants (SV) involved in genetic diseases by combining genomic information with information about gene functions. We incorporate phenotypes linked to genes, functions of gene products, gene expression in individual celltypes, and anatomical sites of expression. DeepSVP systematically relates them to their phenotypic consequences through ontologies and machine learning.
We train and evaluate our method using human SV collected from dbvar dataset.
We integrated the annotations from different sources:
- Gene ontology (GO)
- Uber-anatomy ontology (UBERON)
- Mammalian Phenotype ontology (MP)
- Human Phenotype Ontology (HPO)
This work is done using DL2vec. We convert different types of Description Logic axioms into graph representation, and then generate an embedding for each node and edge type.
We collected genomics features using the AnnotSV (v2.2) public tool.
Using pip version 20.3.1:
pip install deepsvp
Or you can create a specific Conda Environments (e.g. named "deepsvp-py38-pip2031"):
conda create -n deepsvp-py38-pip2031 python=3.8 pip=20.3.1
conda activate deepsvp-py38-pip2031
pip3 install deepsvp
pip3 install networkx
pip3 install torch
pip3 list
conda deactivate
- Download all the files from data and place the uncompressed files/repository in the folder named "data":
mkdir DeepSVP/ ;# /path_of_your_DeepSVP_repository/
cd DeepSVP
wget "https://bio2vec.cbrc.kaust.edu.sa/data/DeepSVP/data.zip"
unzip data.zip
cd data ;# /path_of_your_DeepSVP_data_repository/
wget "https://bio2vec.cbrc.kaust.edu.sa/data/DeepSVP/experiments.zip" # can be very long
unzip experiments.zip
- Download and install the required AnnoSV (2.3) tool in the "data" folder:
cd /path_of_your_DeepSVP_data_repository/
git clone git@github.com:lgmgeo/AnnotSV.git --branch v2.3
cd AnnotSV/
make PREFIX=. install
make DESTDIR= PREFIX=. install-human-annotation
cd ..
- Add genomic features to your VCF input file (/path_and_name_of_your_vcf_input_file/) thanks to AnnotSV (v2.3):
e.g. /path_and_name_of_your_vcf_input_file/ = ./input.vcf
e.g. /path_and_name_of_your_annotsv_output_file/ = ./data/output.annotsv.annotated.tsv
bash
export ANNOTSV=/path_of_your_DeepSVP_data_repository/AnnotSV
$ANNOTSV/bin/AnnotSV -SVinputFile ./input.vcf -genomeBuild GRCh38 -outputFile ./data/output.annotsv.annotated.tsv
Your annotated VCF file (./data/output.annotsv.annotated.tsv) should be placed in the data folder (/path_of_your_DeepSVP_data_repository/).
- Run the command
deepsvp --help
to display help and parameters:
Usage: deepsvp [OPTIONS]
DeepSVP: A phenotype-based tool to prioritize caustive CNV using WGS data
and Phenotype/Gene Functional Similarity
Options:
-d, --data-root TEXT Data root folder [required]
-i, --in-file TEXT Annotated Input file [required]
-p, --hpo TEXT List of phenotype ids separated by commas
[required]
-maf, --maf_filter FLOAT Allele frequency filter using gnomAD and 1000G
default<=0.01
-m, --model_type TEXT Ontology model, one of the following (go , mp ,
hp, cl, uberon, union), default=mp
-ag, --aggregation TEXT Aggregation method for the genes within CNV (max
or mean) default=max
-o, --outfile TEXT Output result file
--help Show this message and exit.
- Run the example (with you own HPO terms):
deepsvp -d data/ -i output.annotsv.annotated.tsv -p HP:0003701,HP:0001324,HP:0010628,HP:0003388,HP:0000774,HP:0002093,HP:0000508,HP:0000218 -m cl -maf 0.01 -ag max -o example_output.txt
Or run the example with the deepsvp-py38-pip2031 Conda Environment:
conda activate deepsvp-py38-pip2031
deepsvp -d data/ -i $your_annotsv_output.annotated.tsv -p HP:0003701,HP:0001324,HP:0010628,HP:0003388,HP:0000774,HP:0002093,HP:0000508,HP:0000218 -m cl -maf 0.01 -ag max -o example_output.txt
conda deactivate
Or by using cwl-runner, modify the input file in the input example yaml deepsvp.yaml file and then run:
cwl-runner deepsvp.cwl deepsvp.yaml
|======== | 25% Reading the input phenotypes...
|================ | 50% Phenotype prediction...
|======================== | 75% CNV Prediction...
|================================| 100% DONE! You can find the prediction results in the output file: example_output.txt
The script will output a ranking a score for the candidate caustive CNV.
- Details for predicting pathogenic variants and comparison with other methods can be found in the experiment folder.
annotations.sh
: This script is used to annotate the varaints.data_preprocessing.py
: preprocessing the annotations and features.pheno_model.py
: script to get the DL2vec score using the trained model.deepsvp_training.py
: script to train and testing the model, with Hyperparameter optimizationBWA_GATK.sh
: script to run GATK workflow for the input fastq files for the real samples, run using KAUST Supercomputing IBEX.run_Manta.sh
: script to generate VCF with the structural variants (SVs), we used Manta to identify the candidate SVs. run using KAUST Supercomputing IBEX.
For any questions or comments please contact: azza.althagafi@kaust.edu.sa