Here are scripts and data used for the paper Assessment of Variant Effect Predictors Unveils Variants Difficulty as a Critical Performance Indicator.
All data generated for the benchmark are available in the Data
folder. This folder comprises two folders:
- difficulty
This folder contains variants classified either as Easy, Moderate or Hard according to their error_rate, we also provide solvent accessibility for each of them when available.
- predictions
This folder contains 65 predictions scores and predicted labels for each dataset. There are the raw data and the processed ones. Data processing consisted in removing variant potentially biased for the evaluation. See Figure 1 of the paper.
Provided pipeline allows you to retrieve 65 predictions from Variant Effect Predictors (VEPs) by using dbNSFP database, precomputed predictions or scripts provided by authors for concerned VEPs.
To retrieve predictions, you will need to download each databases required by running the script download_databases.sh
:
bash scripts/download_databases.sh
- This will download precomputed predictions for: CAPICE, CPT, DeepSAV, Envision, InMeRF, LASSIE, MISTIC, UNEECON, VARITY, VESPA and dbNSFP4.7a.
- The dbNSFP4.4a database contains precomputed predictions for: AlphaMissense, EVE, ESM1b, SIFT, SIFT4G, Polyphen2_HDIV, Polyphen2_HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, VEST4, MetaSVM, MetaLR, MetaRNN, M-CAP, REVEL, MutPred, MVP, gMVP, MPC, PrimateAI, DEOGEN2, BayesDel_addAF, BayesDel_noAF, ClinPred, LIST-S2, VARITY_R, VARITY_ER, VARITY_R_LOO, VARITY_ER_LOO, DANN, fathmm-MKL_coding, fathmm-XF_coding, GenoCanyon, integrated_fitCons, GM12878_fitCons, H1-hESC_fitCons, HUVEC_fitCons, CADD, Eigen-raw_coding, Eigen-PC-raw_coding, GERP++, phastCons100way_vertebrate, phastCons470way_mammalian, phastCons17way_primate, SiPhy_29way_logOdds, and bStatistic.
- Approximately 80 GB of disk space is needed for the download of databases.
- Databases are indexed to speed up the retrieving of predictions.
If you are familiar with docker, you can use the docker image provided within this repo to download every necessary packages. To build the image, run:
docker build -t vep . # vep is just the name of the image built, put whatever you want
Then, you can direcly run the pipeline script with this example command line (also present in the docker_run.sh script). For example, if the variant file is in the time_check
folder in this repo :
docker run --rm -it -e LOCAL_UID=$(id -u $USER) -e LOCAL_GID=$(id -g $USER) -v $(pwd):/data vep -f /data/time_check/clinvar_10.tsv -m panno -g 38 -n clinvar_10_final -d
Be aware to use the -d
option, indicating that the main script is running through a docker image.
If your variant file is in the same folder as the repo (in the $(pwd)
, please add /data/
before the path of the variant file, since the repo will be mounted in the docker image. If the file is not in the repo, you need to use the following command line :
docker run --rm -it -e LOCAL_UID=$(id -u $USER) -e LOCAL_GID=$(id -g $USER) -v $(pwd):/data -v /path/to/external_folder:/ext vep -f /ext/time_check/clinvar_10.tsv -m panno -g 38 -n clinvar_10_final -d
Changed options are -v external_folder:/ext vep -f /ext/time_check/clinvar_10.tsv
, which create another mount within the docker image corresponding to the folder containing the variant file.
Usage of the pipeline script is described below.
If you are not using Docker, you have to set up conda environments.
Two Conda environments are required:
- Main Environment: Set up using the
environment.yml
file:
conda env create -f envs/environment.yml
- Secondary Environment: Use this environment specifically for retrieving predictions from PhD-SNPg, as it requires Python 2 compatibility:
conda create -n phdsnp python=2 scipy
Tabix is used to create index files for large database of genomic positions, which significantly speed up data retrieving. It must be installed using apt-get install
:
sudo apt-get install tabix
Be aware that the first run of the pipeline script will be use to configure downloaded databases (the majority of them are already configured, those that are downloaded from our Zenodo repo). This configuration may take some time.
- panno mode
If your input file resembles the following format
B3GALT6 P36T Benign
ATAD3A R218W Pathogenic
ESPN R113C Benign
CAMTA1 I589V Benign
CAMTA1 R1135W Pathogenic
RERE H1272R Benign
ATP13A2 R460Q Benign
ATP13A2 R76Q Benign
SDHB R230C Pathogenic
SDHB S163P Benign
Then you want to use the panno (protein annotations) mode of the pipeline by running
conda activate VEP
bash scripts/all_pipeline_light.sh -f variant_file.tsv -m panno -g 38
- ganno mode
Otherwise, if your data are in the following format
7 143342009 C T Pathogenic
3 25733901 G A Pathogenic
5 176891149 G A Benign
13 20189344 G A Pathogenic
4 39215732 A T Benign
4 154565832 C T Pathogenic
17 41725075 T C Benign
16 78099348 C T Benign
2 54649628 G A Benign
1 167431690 G A Benign
Then you want to use the ganno (genomic annotations) mode of the pipeline by running
bash scripts/all_pipeline_light.sh -f variant_file.tsv -m ganno -g 38
If genomic position are from GR37 reference genome, use instead
bash scripts/all_pipeline_light.sh -f variant_file.tsv -m ganno -g 37
The pipeline will generate two main folders:
- input_files: Contains all input files necessary to properly extract predictions.
- predictions: Each subfolder corresponds to a specific VEP and contains prediction files.
.
├── input_files
│ └── ESM1v
└── predictions
├── AlphaMissense
├── CAPICE
├── CPT
├── dbNSFP
├── DeepSAV
├── Envision
├── ESM1b
├── ESM1v
├── EVE
├── InMeRF
├── LASSIE
├── MISTIC
├── MutFormer
├── MutScore
├── PhDSNP
├── PONP2
├── SIGMA
├── SuSPect
├── UNEECON
├── VARITY
└── VESPA
If you have suggestions for optimizing the pipeline or would like us to add additional databases, please reach out to us by posting on GitHub or sending an email to radja.ragou@gmail.com.
paper