diff --git a/.gitignore b/.gitignore index 5df7def..a1adfcd 100644 --- a/.gitignore +++ b/.gitignore @@ -7,6 +7,7 @@ __pycache__/ # Distribution / packaging .Python +#env/ build/ develop-eggs/ dist/ @@ -18,9 +19,18 @@ lib64/ parts/ sdist/ var/ +dask-worker-space/ *.egg-info/ .installed.cfg *.egg +*.err +*.out +*.db +*.py*.sh +*.tsv +*.csv +*.gz* + # PyInstaller # Usually these files are written by a python script from a template @@ -41,6 +51,7 @@ htmlcov/ nosetests.xml coverage.xml *,cover +*.pdf # Translations *.mo @@ -72,12 +83,13 @@ target/ .ipynb_checkpoints/ # exclude data from source control by default -# data/ -variant_annotation/data/ +/data/ +cagi*/ #snakemake .snakemake/ - +# data/ +variant_annotation/data/ # exclude test data used for development to_be_deleted/test_data/data/ref @@ -92,3 +104,4 @@ logs/ # .java/fonts dir get created when creating fastqc conda env .java/ +/.vscode/settings.json diff --git a/README.md b/README.md index e4e0f10..5af6ce5 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,140 @@ # DITTO -Diagnosis prediction tool using AI \ No newline at end of file +***!!! For research purposes only !!!*** + +- [DITTO](#ditto) + - [Data](#data) + - [Usage](#usage) + - [Installation](#installation) + - [Requirements](#requirements) + - [Activate conda environment](#activate-conda-environment) + - [Steps to run DITTO predictions](#steps-to-run-ditto-predictions) + - [Run VEP annotation](#run-vep-annotation) + - [Parse VEP annotations](#parse-vep-annotations) + - [Filter variants for Ditto prediction](#filter-variants-for-ditto-prediction) + - [DITTO prediction](#ditto-prediction) + - [Combine with Exomiser scores](#combine-with-exomiser-scores) + - [Cohort level analysis](#cohort-level-analysis) + - [Contact information](#contact-information) + +**Aim:** We aim to develop a pipeline for accurate and rapid prioritization of variants using patient’s genotype (VCF) and/or phenotype (HPO) information. + +## Data + +Input for this project is a single sample VCF file. This will be annotated using VEP and given to Ditto for predictions. + +## Usage + +### Installation + +Installation simply requires fetching the source code. Following are required: + +- Git + +To fetch source code, change in to directory of your choice and run: + +```sh +git clone -b master \ + --recurse-submodules \ + git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/ditto.git +``` + +### Requirements + +*OS:* + +Currently works only in Linux OS. Docker versions may need to be explored later to make it useable in Mac (and +potentially Windows). + +*Tools:* + +- Anaconda3 + - Tested with version: 2020.02 + +### Activate conda environment + +Change in to root directory and run the commands below: + +```sh +# create conda environment. Needed only the first time. +conda env create --file configs/envs/testing.yaml + +# if you need to update existing environment +conda env update --file configs/envs/testing.yaml + +# activate conda environment +conda activate testing +``` + +### Steps to run DITTO predictions + +Remove variants with `*` in `ALT Allele` column. These are called "Spanning or overlapping deletions" introduced in the VCF v4.3 specification. More on this [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-). +Current version of VEP that we're using doesn't support these variants. We will work on this in our future release. + +```sh +bcftools annotate -e'ALT="*" || type!="snp"' path/to/indexed_vcf.gz -Oz -o path/to/indexed_vcf_filtered.vcf.gz +``` + +#### Run VEP annotation + +Please look at the steps to run VEP [here](variant_annotation/README.md) + + +#### Parse VEP annotations + +Please look at the steps to parse VEP annotations [here](annotation_parsing/README.md) + + +#### Filter variants for Ditto prediction + +Filtering step includes imputation and one-hot encoding of columns. + +```sh +python src/Ditto/filter.py -i path/to/parsed_vcf_file.tsv -O path/to/output_directory +``` + +Output from this step includes - + +```directory +output_directory/ +├── data.csv <--- used for Ditto predictions +├── Nulls.csv - indicates number of Nulls in each column +├── stats_nssnv.csv - variant stats from the vcf +├── correlation_plot.pdf- Plot to check if any columns are directly correlated (cutoff >0.95) +└── columns.csv - columns before and after filtering step + +``` + +#### Ditto prediction + +```sh +python src/Ditto/predict.py -i path/to/output_directory/data.csv --sample sample_name -o path/to/output_directory/ditto_predictions.csv -o100 .path/to/output_directory/ditto_predictions_100.csv +``` + +#### Combine with Exomiser scores + +If phenotype terms are present for the sample, one could use Exomiser to rank genes and then prioritize Ditto predictions according to the phenotype. Once you have Exomiser scores, please run the following command to combine Exomiser and Ditto scores + +```sh +python src/Ditto/combine_scores.py --raw .path/to/parsed_vcf_file.tsv --sample sample_name --ditto path/to/output_directory/ditto_predictions.csv -ep path/to/exomiser_scores/directory -o .path/to/output_directory/predictions_with_exomiser.csv -o100 path/to/output_directory/predictions_with_exomiser_100.csv +``` + + +### Cohort level analysis + +Please refer to [CAGI6-RGP](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6) project for filtering and annotation of variants as done above for single sample VCF along with calculating Exomiser scores. + +For predictions, make necessary directory edits to the snakemake [workflow](workflow/Snakefile) and run the following command. + +```sh +sbatch src/predict_variant_score.sh +``` + +**Note**: The commit used for CAGI6 challenge pipeline is [be97cf5d](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/ditto/-/merge_requests/3/diffs?commit_id=be97cf5dbfcb099ac82ef28d5d8b0919f28aed99). It was used along with annotated VCFs and exomiser scores obtained from [rgp_cagi6 workflow](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6). + + +## Contact information + +For issues, please send an email with clear description to + +Tarun Mamidi - tmamidi@uab.edu diff --git a/annotation_parsing/parse_annotated_vars.py b/annotation_parsing/parse_annotated_vars.py index dc94f88..3ed4147 100644 --- a/annotation_parsing/parse_annotated_vars.py +++ b/annotation_parsing/parse_annotated_vars.py @@ -17,10 +17,10 @@ def parse_n_print(vcf, outfile): output_header = ["Chromosome", "Position", "Reference Allele", "Alternate Allele"] + \ line.replace(" Allele|"," VEP_Allele_Identifier|").split("Format: ")[1].rstrip(">").rstrip('"').split("|") elif line.startswith("#CHROM"): - vcf_header = line.split("\t") + vcf_header = line.split("\t") else: break - + for idx, sample in enumerate(vcf_header): if idx > 8: output_header.append(f"{sample} allele depth") @@ -36,7 +36,8 @@ def parse_n_print(vcf, outfile): line = line.rstrip("\n") cols = line.split("\t") csq = parse_csq(next(filter(lambda info: info.startswith("CSQ="),cols[7].split(";"))).replace("CSQ=","")) - var_info = parse_var_info(vcf_header, cols) + #print(line, file=open("var_info.txt", "w")) + #var_info = parse_var_info(vcf_header, cols) alt_alleles = cols[4].split(",") alt2csq = format_alts_for_csq_lookup(cols[3], alt_alleles) for alt_allele in alt_alleles: @@ -45,14 +46,14 @@ def parse_n_print(vcf, outfile): possible_alt_allele4lookup = alt_allele try: write_parsed_variant( - out, - vcf_header, - cols[0], - cols[1], - cols[3], - alt_allele, - csq[possible_alt_allele4lookup], - var_info[alt_allele] + out, + vcf_header, + cols[0], + cols[1], + cols[3], + alt_allele, + csq[possible_alt_allele4lookup] + #,var_info[alt_allele] ) except KeyError: print("Variant annotation matching based on allele failed!") @@ -62,15 +63,15 @@ def parse_n_print(vcf, outfile): raise SystemExit(1) -def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots, var_info): +def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots):#, var_info): var_list = [chr, pos, ref, alt] for annot_info in annots: full_fmt_list = var_list + annot_info - for idx, sample in enumerate(vcf_header): - if idx > 8: - full_fmt_list.append(str(var_info[sample]["alt_depth"])) - full_fmt_list.append(str(var_info[sample]["total_depth"])) - full_fmt_list.append(str(var_info[sample]["prct_reads"])) + #for idx, sample in enumerate(vcf_header): + # if idx > 8: + # full_fmt_list.append(str(var_info[sample]["alt_depth"])) + # full_fmt_list.append(str(var_info[sample]["total_depth"])) + # full_fmt_list.append(str(var_info[sample]["prct_reads"])) out_fp.write("\t".join(full_fmt_list) + "\n") @@ -103,9 +104,9 @@ def parse_csq(csq): parsed_annot = annot.split("|") if parsed_annot[0] not in csq_allele_dict: csq_allele_dict[parsed_annot[0]] = list() - + csq_allele_dict[parsed_annot[0]].append(parsed_annot) - + return csq_allele_dict @@ -129,13 +130,13 @@ def parse_var_info(headers, cols): alt_depth = int(ad_info[alt_index + 1]) total_depth = sum([int(dp) for dp in ad_info]) prct_reads = (alt_depth / total_depth) * 100 - + allele_dict[sample] = { "alt_depth": alt_depth, "total_depth": total_depth, "prct_reads": prct_reads } - + parsed_alleles[alt_allele] = allele_dict return parsed_alleles @@ -184,5 +185,5 @@ def is_valid_file(p, arg): inputf = Path(ARGS.input_vcf) outputf = Path(ARGS.output) if ARGS.output else inputf.parent / inputf.stem.rstrip(".vcf") + ".tsv" - + parse_n_print(inputf, outputf) diff --git a/configs/cluster_config.json b/configs/cluster_config.json new file mode 100644 index 0000000..12fec63 --- /dev/null +++ b/configs/cluster_config.json @@ -0,0 +1,16 @@ +{ + "__default__": { + "ntasks": 1, + "partition": "express", + "cpus-per-task": "{threads}", + "mem-per-cpu": "4G", + "output": "logs/rule_logs/{rule}-%j.log" + }, + "ditto_filter": { + "partition": "largemem", + "mem-per-cpu": "200G" + }, + "combine_scores": { + "mem-per-cpu": "50G" + } +} diff --git a/configs/columns_config.yaml b/configs/columns_config.yaml new file mode 100644 index 0000000..d4ecc19 --- /dev/null +++ b/configs/columns_config.yaml @@ -0,0 +1,295 @@ +# columns to be needed in dataset +columns: + - Chromosome + - Position + - Reference Allele + - Alternate Allele + - Consequence + - IMPACT + - SYMBOL + - Feature + - BIOTYPE + - SIFT + - PolyPhen + - CADD_PHRED + - CADD_RAW + - DANN_score + - Eigen-PC-phred_coding + - Eigen-PC-raw_coding + - Eigen-PC-raw_coding_rankscore + - Eigen-phred_coding + - Eigen-raw_coding + - Eigen-raw_coding_rankscore + - FATHMM_score + - GERP++_RS + - GenoCanyon_score + - LRT_score + - M-CAP_score + - MetaLR_score + - MetaSVM_score + - MutationAssessor_score + - MutationTaster_score + - PROVEAN_score + - SiPhy_29way_logOdds + - VEST4_score + - fathmm-MKL_coding_score + - integrated_fitCons_score + - phastCons100way_vertebrate + - phastCons30way_mammalian + - phyloP100way_vertebrate + - phyloP30way_mammalian + - GERP + - gnomADv3_AF + - gnomADv3_AF_afr + - gnomADv3_AF_afr_female + - gnomADv3_AF_afr_male + - gnomADv3_AF_ami + - gnomADv3_AF_ami_female + - gnomADv3_AF_ami_male + - gnomADv3_AF_amr + - gnomADv3_AF_amr_female + - gnomADv3_AF_amr_male + - gnomADv3_AF_asj + - gnomADv3_AF_asj_female + - gnomADv3_AF_asj_male + - gnomADv3_AF_eas + - gnomADv3_AF_eas_female + - gnomADv3_AF_eas_male + - gnomADv3_AF_female + - gnomADv3_AF_fin + - gnomADv3_AF_fin_female + - gnomADv3_AF_fin_male + - gnomADv3_AF_male + - gnomADv3_AF_nfe + - gnomADv3_AF_nfe_female + - gnomADv3_AF_nfe_male + - gnomADv3_AF_oth + - gnomADv3_AF_oth_female + - gnomADv3_AF_oth_male + - gnomADv3_AF_raw + - gnomADv3_AF_sas + - gnomADv3_AF_sas_female + - gnomADv3_AF_sas_male + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + - hgmd_class + +ClinicalSignificance: + - DP + - DFP + - FP + - DM? + - DM + - Benign + - Benign/Likely_benign + - Pathogenic/Likely_pathogenic + - Pathogenic + - Likely_pathogenic + - Likely_benign + +Clinsig_train: + - Pathogenic/Likely_pathogenic + - DM + - Benign + - Pathogenic + - Likely_benign + - Likely_pathogenic + +Clinsig_test: + + - DM? + - DP + - DFP + - FP + - Benign/Likely_benign + + +CLNREVSTAT: #https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/ + - practice_guideline + - reviewed_by_expert_panel + - criteria_provided,_multiple_submitters,_no_conflicts + - criteria_provided,_single_submitter + +col_conv: + - MutationTaster_score + - MutationAssessor_score + - PROVEAN_score + - VEST4_score + - FATHMM_score + - GERP + +ML_VAR: + - SYMBOL + - Feature + - Consequence + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + - Chromosome + - Position + - Alternate Allele + - Reference Allele + - ID + +var: + - SYMBOL + - Feature + - Consequence + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + - Chromosome + - Position + - Alternate Allele + - Reference Allele + +Consequence: + - missense_variant + - missense_variant&splice_region_variant + - stop_gained + - start_lost&NMD_transcript_variant + - start_lost + - stop_gained&splice_region_variant + - stop_gained&NMD_transcript_variant + - missense_variant&splice_region_variant&NMD_transcript_variant + - stop_gained&splice_region_variant&NMD_transcript_variant + - start_lost&splice_region_variant + - stop_lost&splice_region_variant + - stop_lost&splice_region_variant&NMD_transcript_variant + - splice_donor_variant&missense_variant + - start_lost&splice_region_variant&NMD_transcript_variant + +gnomad_columns: + - gnomADv3_AF + - gnomADv3_AF_afr + - gnomADv3_AF_afr_female + - gnomADv3_AF_afr_male + - gnomADv3_AF_ami + - gnomADv3_AF_ami_female + - gnomADv3_AF_ami_male + - gnomADv3_AF_amr + - gnomADv3_AF_amr_female + - gnomADv3_AF_amr_male + - gnomADv3_AF_asj + - gnomADv3_AF_asj_female + - gnomADv3_AF_asj_male + - gnomADv3_AF_eas + - gnomADv3_AF_eas_female + - gnomADv3_AF_eas_male + - gnomADv3_AF_female + - gnomADv3_AF_fin + - gnomADv3_AF_fin_female + - gnomADv3_AF_fin_male + - gnomADv3_AF_male + - gnomADv3_AF_nfe + - gnomADv3_AF_nfe_female + - gnomADv3_AF_nfe_male + - gnomADv3_AF_oth + - gnomADv3_AF_oth_female + - gnomADv3_AF_oth_male + - gnomADv3_AF_raw + - gnomADv3_AF_sas + - gnomADv3_AF_sas_female + - gnomADv3_AF_sas_male + +nssnv_columns: + SIFT: 0.5 + PolyPhen: 0.5 + CADD_PHRED: 20 + MetaSVM_score: 0.5 # range - -2 to 3 + FATHMM_score: 0 #weighted for human inherited- disease mutations with Disease Ontology (FATHMMori). Scores range from -18.09 to 11.0. This is for coding and MKL for non-coding variants + MutationAssessor_score: 0 #MutationAssessor functional impact combined score (MAori). The score ranges from -5.135 to 6.49 in dbNSFP. + PROVEAN_score: 0 # range - -14 to 13.57 + VEST4_score: 0.5 # range - 0 to 1 + GERP: 0 #ranges from -12.36 to 6.18 + MutationTaster_score: 0.5 + DANN_score: 0.5 + Eigen-PC-phred_coding: 20 + Eigen-PC-raw_coding: 0 #functional annotations. input range - -3.252 to 8.426. + Eigen-PC-raw_coding_rankscore: 0 + Eigen-phred_coding: 20 + Eigen-raw_coding: 0 + Eigen-raw_coding_rankscore: 0 + GERP++_RS: 0 + GenoCanyon_score: 0.5 + LRT_score: 0.5 #The score ranges from 0 to 1 and a larger score signifies that the codon is more constrained or a NS is more likely to be deleterious. + M-CAP_score: 0.5 # range - 0 to 1 + MetaLR_score: 0.5 # range - 0 to 1 + SiPhy_29way_logOdds: 15 # input range - 0 to 33.272. SiPhy score based on 29 mammals genomes. The larger the score, the more conserved the site. + fathmm-MKL_coding_score: 0.5 # range - 0 to 1. This is for non-coding variants - functional prediction tool + integrated_fitCons_score: 0.5 # range - 0 to 1 + phastCons100way_vertebrate: 0.5 + phastCons30way_mammalian: 0.5 + phyloP100way_vertebrate: 20 + phyloP30way_mammalian: 20 + IMPACT_HIGH: 0 + IMPACT_LOW: 0 + IMPACT_MODERATE: 0 + IMPACT_MODIFIER: 0 + BIOTYPE_RNase_MRP_RNA: 0 + BIOTYPE_RNase_P_RNA: 0 + BIOTYPE_antisense_RNA: 0 + BIOTYPE_guide_RNA: 0 + BIOTYPE_lncRNA: 0 + BIOTYPE_miRNA: 0 + BIOTYPE_misc_RNA: 0 + BIOTYPE_ncRNA_pseudogene: 0 + BIOTYPE_protein_coding: 0 + BIOTYPE_scRNA: 0 + BIOTYPE_snRNA: 0 + BIOTYPE_snoRNA: 0 + BIOTYPE_telomerase_RNA: 0 + BIOTYPE_transcribed_pseudogene: 0 + BIOTYPE_nonsense_mediated_decay: 0 + BIOTYPE_polymorphic_pseudogene: 0 + BIOTYPE_IG_C_gene: 0 + BIOTYPE_non_stop_decay: 0 + +non_nssnv_columns: + CADD_PHRED: 20 + GERP: 0 + IMPACT_HIGH: 0 + IMPACT_LOW: 0 + IMPACT_MODERATE: 0 + IMPACT_MODIFIER: 0 + BIOTYPE_RNase_MRP_RNA: 0 + BIOTYPE_RNase_P_RNA: 0 + BIOTYPE_antisense_RNA: 0 + BIOTYPE_guide_RNA: 0 + BIOTYPE_lncRNA: 0 + BIOTYPE_miRNA: 0 + BIOTYPE_misc_RNA: 0 + BIOTYPE_protein_coding: 0 + BIOTYPE_snRNA: 0 + BIOTYPE_snoRNA: 0 + BIOTYPE_transcribed_pseudogene: 0 + +nssnv_median_3_0_1: + gnomADv3_AF: 0 + SIFT: 0.02 + PolyPhen: 0.757 + CADD_PHRED: 25.3 + DANN_score: 0.995325924 + Eigen-PC-phred_coding: 4.859703 + Eigen-PC-raw_coding: 0.465540408 + FATHMM_score: -1.26 + GERP++_RS: 4.73 + GenoCanyon_score: 0.999998451 + LRT_score: 2.30E-05 + M-CAP_score: 0.226631 + MetaSVM_score: -0.1847 + MutationAssessor_score: 2.215 + MutationTaster_score: 1 + PROVEAN_score: -2.88 + SiPhy_29way_logOdds: 13.8644 + VEST4_score: 0.762 + fathmm-MKL_coding_score: 0.94206 + integrated_fitCons_score: 0.675202 + phastCons100way_vertebrate: 1 + phastCons30way_mammalian: 0.99 + phyloP100way_vertebrate: 4.564 + phyloP30way_mammalian: 1.026 + GERP: 5.67 + IMPACT_HIGH: 0 + BIOTYPE_IG_C_gene: 0 + BIOTYPE_non_stop_decay: 0 + BIOTYPE_polymorphic_pseudogene: 0 + BIOTYPE_protein_coding: 1 diff --git a/configs/envs/environment.yaml b/configs/envs/environment.yaml new file mode 100644 index 0000000..8bdfe13 --- /dev/null +++ b/configs/envs/environment.yaml @@ -0,0 +1,26 @@ +name: training + +channels: + - conda-forge + - anaconda + +dependencies: + - python=3.8.5 + - pandas=1.2.1 + - numpy=1.18.5 + - optuna=2.5.0 + - scikit-learn=0.24.1 + - imbalanced-learn=0.7.0 + - scipy=1.4.1 + - shap=0.37.0 + - pip + - gpy=1.9.9 + - scikit-optimize=0.8.1 + - hyperopt=0.2.5 + - tune-sklearn=0.2.1 + - seaborn=0.11.2 + - pip: + - ray==1.6.0 + - ray[tune] + - tensorflow-gpu==2.3 + - lz4==3.1.3 diff --git a/configs/envs/testing.yaml b/configs/envs/testing.yaml new file mode 100644 index 0000000..7e5100f --- /dev/null +++ b/configs/envs/testing.yaml @@ -0,0 +1,24 @@ +name: testing + +channels: + - conda-forge + - anaconda + - bioconda + +dependencies: + - python=3.9.7 + - pandas=1.3.3 + - numpy=1.19.5 + - scikit-learn=0.24.2 + - imbalanced-learn=0.7.0 + - scipy=1.7.1 + - shap=0.39.0 + - bcftools=1.13 + - pip=21.2.4 + - bioconda::snakefmt==0.4.0 + - bioconda::snakemake==6.0.5 + - seaborn=0.11.2 + - black=20.8b1 + - pylint=2.11.1 + - lz4=3.1.3 + - gpy=1.10.0 diff --git a/configs/testing.yaml b/configs/testing.yaml new file mode 100644 index 0000000..175c398 --- /dev/null +++ b/configs/testing.yaml @@ -0,0 +1,159 @@ +# columns to be needed in dataset +columns: + - Chromosome + - Position + - Reference Allele + - Alternate Allele + - Consequence + - Gene + - HGNC_ID + - IMPACT + - SYMBOL + - Feature + - BIOTYPE + - SIFT + - PolyPhen + - CADD_PHRED + - CADD_RAW + - DANN_score + - Eigen-PC-phred_coding + - Eigen-PC-raw_coding + - Eigen-PC-raw_coding_rankscore + - Eigen-phred_coding + - Eigen-raw_coding + - Eigen-raw_coding_rankscore + - FATHMM_score + - GERP++_RS + - GenoCanyon_score + - LRT_score + - M-CAP_score + - MetaLR_score + - MetaSVM_score + - MutationAssessor_score + - MutationTaster_score + - PROVEAN_score + - SiPhy_29way_logOdds + - VEST4_score + - fathmm-MKL_coding_score + - integrated_fitCons_score + - phastCons100way_vertebrate + - phastCons30way_mammalian + - phyloP100way_vertebrate + - phyloP30way_mammalian + - GERP + - gnomADv3_AF + - gnomADv3_AF_afr + - gnomADv3_AF_afr_female + - gnomADv3_AF_afr_male + - gnomADv3_AF_ami + - gnomADv3_AF_ami_female + - gnomADv3_AF_ami_male + - gnomADv3_AF_amr + - gnomADv3_AF_amr_female + - gnomADv3_AF_amr_male + - gnomADv3_AF_asj + - gnomADv3_AF_asj_female + - gnomADv3_AF_asj_male + - gnomADv3_AF_eas + - gnomADv3_AF_eas_female + - gnomADv3_AF_eas_male + - gnomADv3_AF_female + - gnomADv3_AF_fin + - gnomADv3_AF_fin_female + - gnomADv3_AF_fin_male + - gnomADv3_AF_male + - gnomADv3_AF_nfe + - gnomADv3_AF_nfe_female + - gnomADv3_AF_nfe_male + - gnomADv3_AF_oth + - gnomADv3_AF_oth_female + - gnomADv3_AF_oth_male + - gnomADv3_AF_raw + - gnomADv3_AF_sas + - gnomADv3_AF_sas_female + - gnomADv3_AF_sas_male + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + +col_conv: + - MutationTaster_score + - MutationAssessor_score + - PROVEAN_score + - VEST4_score + - FATHMM_score + - GERP + +ML_VAR: + - SYMBOL + - Feature + - Consequence + - Gene + - HGNC_ID + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + - Chromosome + - Position + - Alternate Allele + - Reference Allele + #- ID + +var: + - SYMBOL + - Feature + - Consequence + - Gene + - HGNC_ID + - clinvar_CLNREVSTAT + - clinvar_CLNSIG + - Chromosome + - Position + - Alternate Allele + - Reference Allele + +Consequence: + - missense_variant + - missense_variant&splice_region_variant + - stop_gained + - start_lost&NMD_transcript_variant + - start_lost + - stop_gained&splice_region_variant + - stop_gained&NMD_transcript_variant + - missense_variant&splice_region_variant&NMD_transcript_variant + - stop_gained&splice_region_variant&NMD_transcript_variant + - start_lost&splice_region_variant + - stop_lost&splice_region_variant + - stop_lost&splice_region_variant&NMD_transcript_variant + - splice_donor_variant&missense_variant + - start_lost&splice_region_variant&NMD_transcript_variant + +nssnv_median_3_0_1: + gnomADv3_AF: 0 + SIFT: 0.02 + PolyPhen: 0.757 + CADD_PHRED: 25.3 + DANN_score: 0.995325924 + Eigen-PC-phred_coding: 4.859703 + Eigen-PC-raw_coding: 0.465540408 + FATHMM_score: -1.26 + GERP++_RS: 4.73 + GenoCanyon_score: 0.999998451 + LRT_score: 2.30E-05 + M-CAP_score: 0.226631 + MetaSVM_score: -0.1847 + MutationAssessor_score: 2.215 + MutationTaster_score: 1 + PROVEAN_score: -2.88 + SiPhy_29way_logOdds: 13.8644 + VEST4_score: 0.762 + fathmm-MKL_coding_score: 0.94206 + integrated_fitCons_score: 0.675202 + phastCons100way_vertebrate: 1 + phastCons30way_mammalian: 0.99 + phyloP100way_vertebrate: 4.564 + phyloP30way_mammalian: 1.026 + GERP: 5.67 + IMPACT_HIGH: 0 + BIOTYPE_IG_C_gene: 0 + BIOTYPE_non_stop_decay: 0 + BIOTYPE_polymorphic_pseudogene: 0 + BIOTYPE_protein_coding: 1 diff --git a/dag.png b/dag.png new file mode 100644 index 0000000..25841e4 Binary files /dev/null and b/dag.png differ diff --git a/docs/Pipeline.txt b/docs/Pipeline.txt new file mode 100644 index 0000000..769fdd0 --- /dev/null +++ b/docs/Pipeline.txt @@ -0,0 +1,65 @@ +** Current workflow: ** +`module load Anaconda3/2020.02 +module load tabix +module load BCFtools` +Download Clinvar (2/21/21 ) - + `wget -P /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/external/ https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz + tabix -fp vcf clinvar.vcf.gz ` +Check chromosomes and keep note for modifications - + `zgrep -v ^# clinvar.vcf.gz | cut -f1 -d$'\t' | sort -u` +Copy HGMD VCF - + `cp /data/project/worthey_lab/manual_datasets_central/hgmd/2020q4/hgmd_pro_2020.4_hg38.vcf ./` +Check chromosomes and keep note for modifications - + `grep -v ^# hgmd_pro_2020.4_hg38.vcf | cut -f1 -d$'\t' | sort -u` +Fix the INFO column and index for merging - + `sed -E 's/(^[^#]+)(=")([^;"]+)(;)+([^;]*?)(")/\1\2\3%3B\5\6/' hgmd_pro_2020.4_hg38.vcf > hgmd_pro_2020.4_hg38_fixed_info.vcf` + bgzip -c hgmd_pro_2020.4_hg38_fixed_info.vcf > hgmd_pro_2020.4_hg38_fixed_info.vcf.gz + tabix -fp vcf hgmd_pro_2020.4_hg38_fixed_info.vcf.gz ` +Merge Clinvar and HGMD - + `bcftools merge clinvar.vcf.gz hgmd_pro_2020.4_hg38_fixed_info.vcf.gz -Ov -o ../interim/merged.vcf` +Add `chr` to chromosomes columns - + `sed -E -i 's/(^[^#]+)/chr\1/' ../interim/merged.vcf ` +Fix chromosome issues noted before - + `sed -i 's/^chrMT/chrM/g' ../interim/merged.vcf + grep -v ^chrNW ../interim/merged.vcf > ../interim/merged_chr_fix.vcf` +Check chromosomes and fix any remaining issues - + `grep -v ^# ../interim/merged_chr_fix.vcf | cut -f1 -d$'\t' | sort -u` +Normalize the variants using reference genome - + `bcftools norm -f /data/project/worthey_lab/datasets_central/human_reference_genome/processed/GRCh38/no_alt_rel20190408/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna ../interim/merged_chr_fix.vcf -Oz -o ../interim/merged_norm.vcf.gz` +Filter variants by size (<30kb) and class - + `python ../../src/training/data-prep/extract_variants.py` + Clinvar variants: 305386 + HGMD variants: 156402 +bgzip and Tabix index the file - + `bgzip -c merged_sig_norm.vcf > merged_sig_norm.vcf.gz + tabix -fp vcf ../interim/merged_sig_norm.vcf.gz` +Copy paths to dataset yaml file - + ``` + cadd_snv: "/data/project/worthey_lab/temp_datasets_central/mana/cadd/raw/hg38/v1.6/whole_genome_SNVs.tsv.gz" + cadd_indel: "/data/project/worthey_lab/temp_datasets_central/mana/cadd/raw/hg38/v1.6/gnomad.genomes.r3.0.indel.tsv.gz" + gerp: "/data/project/worthey_lab/temp_datasets_central/mana/gerp/processed/hg38/v1.6/gerp_score_hg38.bg.gz" + gnomad_genomes: "/data/project/worthey_lab/temp_datasets_central/mana/gnomad/v3.0/data/gnomad.genomes.r3.0.sites.vcf.bgz" + clinvar: "/data/project/worthey_lab/temp_datasets_central/mana/clinvar/data/grch38/20210119/clinvar_20210119.vcf.gz" + dbNSFP: "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/variant_annotation/formatting/dbNSFP4.1a_variant.complete.gz" + ``` +Run variant annotation as shown in ReadMe file - + `./src/run_pipeline.sh -s -v ../data/interim/merged_sig_norm.vcf.gz -o ../data/interim -d ~/.ditto_datasets.yaml` +Parse the annotated vcf file - + `python parse_annotated_vars.py -i ../data/interim/merged_sig_norm_vep-annotated.vcf.gz -o ../data/interim/merged_sig_norm_vep-annotated.tsv` +Extract Class information for all these variants - + `python extract_class.py` +Filter, stats and prep the data - + `python filter.py` + + +For testing Ditto - +```sh + cp /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/annotated_vcf/train/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated.vcf.gz ./data/processed/testing/ + module load BCFtools + bcftools annotate -e'ALT="*" || type!="snp"' ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated.vcf.gz -Oz -o ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.vcf.gz + python annotation_parsing/parse_annotated_vars.py -i ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.vcf.gz -o ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.tsv + python src/Ditto/filter.py -i ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.tsv -O ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND + python src/Ditto/predict.py -i ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/data.csv --sample CAGI6_RGP_TRAIN_12_PROBAND -o ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/ditto_predictions.csv -o100 ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/ditto_predictions_100.csv + python src/Ditto/combine_scores.py --raw ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.tsv --sample CAGI6_RGP_TRAIN_12_PROBAND --ditto ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/ditto_predictions.csv -ep /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/exomiser/train/CAGI6_RGP_TRAIN_12_PROBAND -o ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/predictions_with_exomiser.csv -o100 ./data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND/predictions_with_exomiser_100.csv + +``` diff --git a/docs/Plan-for-project.txt b/docs/Plan-for-project.txt new file mode 100644 index 0000000..b7e5fdf --- /dev/null +++ b/docs/Plan-for-project.txt @@ -0,0 +1,85 @@ +Plan for Variant prioritization tool + +Objective: +Application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. It should also predict and classify the variants based on ACMG guidelines. +Background: In the field of genomic medicine, the primary goal of diagnosing rare disease patient is to identify one or more genomic variants that may be responsible for a particular phenotype. This process is typically through annotation, filtering and ranking/prioritizing variants for manual curation. These curated variants are then clinically reported back to the patient. +Problem: Manual curation of thousands of variants even after filtering is time consuming process. Average time of curation is 100 variants per man-hour. Thus, methods/tools that can identify variants to be clinically reported, even in presence of high degree of variability in phenotype presentation, are of critical importance. +Solution: Develop a tool for accurate prioritization of variants, reduces time to review and diagnose rare disease patients. +Dataset: +Genotypes – +• SNVs – +o allele frequency, +o sequence based, structure-based predictions (ClinVarRVRD), +o sequence conservation, splice factor motifs, splice donor/acceptor sites, RNA folding energy, codon usage, CpG content (SilVa tool) +o splice donors and acceptors in pre-mRNA transcripts (SpliceAI) +• oligogenic or multilocus genetic pattern (VarCoPP) +• CNVs (CNVdigest) +• Gene expression pathway level training (MuliPLIER) – database (recount2) +• Protein-chaperon interaction (DeepNEU) +• eQTL as feature? +• Can we use polygenic risk scores? probably to identify modifier variants? +Note – population based = improved prediction? MAF filter? + +Genotype-Phenotype – +• Variant pathogenicity prediction and annotation (eDiVA) for WES datasets +• Genotypic and phenotypic features (Xrare and DeepPVP – deep phenomeNET variant predictor) +Note - HPO terms for human and orthologous genes from model organisms (mouse, zebrafish etc.,) and phenotypes from protein-protein interactions. OMIM orphaned, IMPC, string DB. Monarch +Phenotype – +• Phenotypes (HPO, upheno), diseases (Mondo), genes and pathways (HANRD - heterogeneous association network for rare diseases, Orphamizer) +• EHR data (Ada XD, Dr. Warehouse) – this is not HPO +• Image based (Face2Gene, DeepGestalt) +• IR fingerprint (artificial neural networks) +Variants – ClinVar, InterVar, CancerVar, CNVinter, HGMD, DGV (db of genomic variants), dbSNP, Varibench, HumVar, ExoVar, predictSNP, and SwissVar +Features – allele frequency, specific populations, evolutionary conservation, functional impact (SIFT), segmental duplication, simple sequence repeats, ClinVar and OMIM; local context: GC content within 10 flanking bases on the reference genome; amino acid constraint, including blosum62 and pam250; Protein structure, interaction, and modifications, including predicted secondary structures, number of protein interactions from the BioPlex 2.0 Network, whether the protein is involved in complexes formation from CORUM database, number of high-confidence interacting proteins by PrePPI , probability of a residue being located the interaction interface by PrePPI (based on PPISP, PINUP, PredU), predicted accessible surface areas were obtained from dbPTM, SUMO scores in 7-amino acids neighborhood by GPS-SUMO, phosphorylation sites predictions within 7 amino acids neighborhood by GPS3.0, and ubiquitination scores within 14- amino acids neighborhood by UbiProber ; Gene mutation intolerance, including ExAC metrics – loss of function (check mvp paper) +Algorithms – +• Machine learning – sklearn, imblearn +• Deep learning/neural networks +• Convolutional Neural networks +• AI ?? +Training and Testing +• Probably use 80:20 from all clinvar variants +Tuning parameters +• Use population based tuning for large parameter space and large train data - https://deepmind.com/blog/article/population-based-training-neural-networks +• Use Optuna for define-by-run search space and parallelize it. +• ELI5 or SHAP for feature importance +Simulation data- +• Use 1000genome project VCF and add variants with HPO terms and create multiple vcf files. Use this as test. +• Use SNV vcfs and compare to SNV prioritization tools. Same with other type of variants and tools. Finally, combine all types of variants and test the model. +• Also simulate using inheritance patterns – a bit complicated – refer to eDiVA paper +Real data – +• Use UDN phase-1 data as a test set. +Results – +Variant ranking – +• Top 1 variant predicted by tool +• In top 10 list predicted by tool +• In top 20-30% list predicted by tool +• Not predicted/ unpredictable +Things to keep in mind – +• When seeing new values in some categorical columns, look for additive smoothing, also called Laplace smoothing +• Nuclear variants vs mitochondrial variants +• Hyperparameter tuning of features and check for F1 score +• Sequencing errors +• Default features and in combination with other predictors? +• Similar phenotype – similar mutation in the same gene? +• Inheritance patterns? a) dominant de novo, (b) autosomal dominant inherited, (c) autosomal recessive homozygous, (d) autosomal recessive compound heterozygous, or (e) X‐linked. +• Xu et al. developed Dic-Att-BiLSTM-CRF (DABLC), a deep attention NN method. By incorporating dictionary-based (using disease ontology) and document-based attention mechanisms, this new method outperformed existing ones at identifying rare and complex disease names + + +Tools to look/read: +1. rvtests - https://github.com/zhanxw/rvtests +2. favor - http://favor.genohub.org/ +3. LINSIGHT - https://www.nature.com/articles/ng.3810 +4. STAAR - https://github.com/xihaoli/STAAR ; https://www.nature.com/articles/s41588-020-0676-4 +5. Methylome mappability - https://bismap.hoffmanlab.org/ +6. Linkage disequilibrium - https://www.nature.com/articles/ng.3954 +7. Assay info - Encode (https://www.encodeproject.org/report/?type=Experiment) +8. Phen2Gene - https://academic.oup.com/nargab/article/2/2/lqaa032/5843800 +9. Hyperas - Hyperparameter tuning using the treestructured Parzen estimator (TPE) algorithm (DeepPVP). +10. Miscastv1.0 - Protein surface prediction + - dbs to consider - PhyreRisk, VarMap +11. AMELIE - page-13 in https://stm.sciencemag.org/content/scitransmed/suppl/2020/05/18/12.544.eaau9113.DC1/aau9113_SM.pdf +12. Cool plots for parameter tuning using hiplot - https://medium.com/roonyx/neural-network-hyper-parameter-tuning-with-keras-tuner-and-hiplot-7637677821fa + +Bluesky/User requirements/ideas: +1. Can we give 2 scores per variant: One as driver (monogenic), one as modifier (complex diseases). +2. Can it also predict if a variant is protective? \ No newline at end of file diff --git a/src/Ditto/combine_scores.py b/src/Ditto/combine_scores.py new file mode 100644 index 0000000..79a5bd0 --- /dev/null +++ b/src/Ditto/combine_scores.py @@ -0,0 +1,185 @@ +import pandas as pd +import warnings + +warnings.simplefilter("ignore") +import argparse +import os +import glob + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--raw", type=str, required=True, help="Input raw annotated file with path." + ) + parser.add_argument( + "--ditto", type=str, required=True, help="Input Ditto file with path." + ) + parser.add_argument( + "--exomiser", + "-ep", + type=str, + # default="predictions.csv", + help="Path to Exomiser output directory", + ) + parser.add_argument( + "--sample", + type=str, + # required=True, + help="Input sample name to showup in results", + ) + parser.add_argument( + "--output", + "-o", + type=str, + default="predictions_with_exomiser.csv", + help="Output csv file with path", + ) + parser.add_argument( + "--output100", + "-o100", + type=str, + default="predictions_with_exomiser_100.csv", + help="Output csv file with path for Top 100 variants", + ) + parser.add_argument( + "--output1000", + "-o1000", + type=str, + default="predictions_with_exomiser_1000.csv", + help="Output csv file with path for Top 1000 variants", + ) + args = parser.parse_args() + # print (args) + + ditto = pd.read_csv(args.ditto) + raw = pd.read_csv( + args.raw, + sep="\t", + usecols=[ + "SYMBOL", + "Chromosome", + "Position", + "Reference Allele", + "Alternate Allele", + "SYMBOL", + "Gene", + "Feature", + "HGNC_ID", + ], + ) + # raw = raw[['Chromosome','Position','Reference Allele','Alternate Allele','SYMBOL','Gene','Feature', 'HGNC_ID']] + print("Raw file loaded!") + + overall = pd.merge( + raw, + ditto, + how="left", + on=[ + "Chromosome", + "Position", + "Alternate Allele", + "Reference Allele", + "Feature", + ], + ) + # print(overall.columns.values.tolist()) + del raw, ditto + id_map = pd.read_csv( + "/data/project/worthey_lab/temp_datasets_central/tarun/HGNC/biomart_9_23_21.txt", + sep="\t", + ) + + if args.exomiser: + print("Reading Exomiser scores...") + all_files = glob.glob(os.path.join(args.exomiser, "*.tsv")) + exo_scores = pd.concat( + (pd.read_csv(f, sep="\t") for f in all_files), ignore_index=True + ) + exo_scores = exo_scores[ + ["#GENE_SYMBOL", "ENTREZ_GENE_ID", "EXOMISER_GENE_PHENO_SCORE"] + ] + id_map = id_map.merge( + exo_scores, left_on="NCBI gene ID", right_on="ENTREZ_GENE_ID" + ) + overall = overall.merge( + id_map, how="left", left_on="HGNC_ID_x", right_on="HGNC ID" + ) + del id_map, exo_scores + # overall = overall.sort_values(by = ['Ditto_Deleterious','EXOMISER_GENE_PHENO_SCORE'], axis=0, ascending=[False,False], kind='quicksort', ignore_index=True) + # overall['Exo_norm'] = (overall['EXOMISER_GENE_PHENO_SCORE'] - overall['EXOMISER_GENE_PHENO_SCORE'].min()) / (overall['EXOMISER_GENE_PHENO_SCORE'].max() - overall['EXOMISER_GENE_PHENO_SCORE'].min()) + overall["combined"] = ( + overall["EXOMISER_GENE_PHENO_SCORE"].fillna(0) + + overall["Ditto_Deleterious"].fillna(0) + ) / 2 + overall = overall[ + [ + "SYMBOL_x", + "Chromosome", + "Position", + "Reference Allele", + "Alternate Allele", + "EXOMISER_GENE_PHENO_SCORE", + "Ditto_Deleterious", + "combined", + "SD", + "C", + ] + ] + overall.insert(0, "PROBANDID", args.sample) + overall.columns = [ + "PROBANDID", + "SYMBOL", + "CHROM", + "POS", + "REF", + "ALT", + "E", + "D", + "P", + "SD", + "C", + ] + # genes = genes[genes['EXOMISER_GENE_PHENO_SCORE'] != 0] + + # overall.sort_values('pred_Benign', ascending=False).head(500).to_csv(args.output500, index=False) + else: + # overall = overall.sort_values('Ditto_Deleterious', ascending=False) + overall = overall[ + [ + "SYMBOL_x", + "Chromosome", + "Position", + "Reference Allele", + "Alternate Allele", + "Ditto_Deleterious", + "SD", + "C", + ] + ] + overall.insert(0, "PROBANDID", args.sample) + overall.columns = [ + "PROBANDID", + "SYMBOL", + "CHROM", + "POS", + "REF", + "ALT", + "P", + "SD", + "C", + ] + + overall = overall.sort_values("P", ascending=False) + overall = overall.reset_index(drop=True) + overall["SD"] = 0 + overall["C"] = "*" + overall.to_csv(args.output, index=False) + + overall = overall.drop_duplicates( + subset=["CHROM", "POS", "REF", "ALT"], keep="first" + ).reset_index(drop=True) + overall = overall[["PROBANDID", "CHROM", "POS", "REF", "ALT", "P", "SD", "C"]] + overall.head(100).to_csv(args.output100, index=False, sep=":") + overall.head(1000).to_csv(args.output1000, index=False, sep=":") + + # del genes, overall diff --git a/src/Ditto/filter.py b/src/Ditto/filter.py new file mode 100644 index 0000000..55af4da --- /dev/null +++ b/src/Ditto/filter.py @@ -0,0 +1,194 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# python slurm-launch.py --exp-name testing --command "python Ditto/filter.py -i ../data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND_vep-annotated_filtered.tsv -O ../data/processed/testing/CAGI6_RGP_TRAIN_12_PROBAND" + +import pandas as pd + +pd.set_option("display.max_rows", None) +import numpy as np +from tqdm import tqdm +import seaborn as sns +import yaml +import os +import argparse +import matplotlib.pyplot as plt + +# from sklearn.linear_model import LinearRegression +# from sklearn.experimental import enable_iterative_imputer +# from sklearn.impute import IterativeImputer +# import pickle + + +def get_col_configs(config_f): + with open(config_f) as fh: + config_dict = yaml.safe_load(fh) + + # print(config_dict) + return config_dict + + +def extract_col(config_dict, df, stats): + print("Extracting columns and rows according to config file !....") + df = df[config_dict["columns"]] + if "non_snv" in stats: + # df= df.loc[df['hgmd_class'].isin(config_dict['Clinsig_train'])] + df = df[ + (df["Alternate Allele"].str.len() > 1) + | (df["Reference Allele"].str.len() > 1) + ] + print("\nData shape (non-snv) =", df.shape, file=open(stats, "a")) + else: + # df= df.loc[df['hgmd_class'].isin(config_dict['Clinsig_train'])] + df = df[ + (df["Alternate Allele"].str.len() < 2) + & (df["Reference Allele"].str.len() < 2) + ] + if "protein" in stats: + df = df[df["BIOTYPE"] == "protein_coding"] + else: + pass + print("\nData shape (snv) =", df.shape, file=open(stats, "a")) + df = df.loc[df["Consequence"].isin(config_dict["Consequence"])] + print("\nData shape (nsSNV) =", df.shape, file=open(stats, "a")) + + # print('\nhgmd_class:\n', df['hgmd_class'].value_counts(), file=open(stats, "a")) + print( + "\nclinvar_CLNSIG:\n", + df["clinvar_CLNSIG"].value_counts(), + file=open(stats, "a"), + ) + print( + "\nclinvar_CLNREVSTAT:\n", + df["clinvar_CLNREVSTAT"].value_counts(), + file=open(stats, "a"), + ) + print("\nConsequence:\n", df["Consequence"].value_counts(), file=open(stats, "a")) + print("\nIMPACT:\n", df["IMPACT"].value_counts(), file=open(stats, "a")) + print("\nBIOTYPE:\n", df["BIOTYPE"].value_counts(), file=open(stats, "a")) + # df = df.drop(['CLNVC','MC'], axis=1) + # CLNREVSTAT, CLNVC, MC + return df + + +def fill_na(df, config_dict, column_info, stats): # (config_dict,df): + + var = df[config_dict["var"]] + df = df.drop(config_dict["var"], axis=1) + print("parsing difficult columns......") + # df['GERP'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['GERP'].str.split('&')] + if "nssnv" in stats: + # df['MutationTaster_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['MutationTaster_score'].str.split('&')] + # df['MutationAssessor_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['MutationAssessor_score'].str.split('&')] + # df['PROVEAN_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['PROVEAN_score'].str.split('&')] + # df['VEST4_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['VEST4_score'].str.split('&')] + # df['FATHMM_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['FATHMM_score'].str.split('&')] + # else: + for col in tqdm(config_dict["col_conv"]): + df[col] = [ + np.mean( + [ + float(item.replace(".", "0")) if item == "." else float(item) + for item in i.split("&") + ] + ) + if "&" in str(i) + else i + for i in df[col] + ] + df[col] = df[col].astype("float64") + + print("One-hot encoding...") + df = pd.get_dummies(df, prefix_sep="_") + print(df.columns.values.tolist(), file=open(column_info, "w")) + + # lr = LinearRegression() + # imp= IterativeImputer(estimator=lr, verbose=2, max_iter=10, tol=1e-10, imputation_order='roman') + print("Filling NAs ....") + # df = imp.fit_transform(df) + # df = pd.DataFrame(df, columns = columns) + + df1 = pd.DataFrame() + + if "non_nssnv" in stats: + for key in tqdm(config_dict["non_nssnv_columns"]): + if key in df.columns: + df1[key] = ( + df[key] + .fillna(config_dict["non_nssnv_columns"][key]) + .astype("float64") + ) + else: + df1[key] = config_dict["non_nssnv_columns"][key] + else: + for key in tqdm(config_dict["nssnv_median_3_0_1"]): + if key in df.columns: + df1[key] = ( + df[key] + .fillna(config_dict["nssnv_median_3_0_1"][key]) + .astype("float64") + ) + else: + df1[key] = config_dict["nssnv_median_3_0_1"][key] + df = df1 + # df = df.drop(df.std()[(df.std() == 0)].index, axis=1) + del df1 + df = df.reset_index(drop=True) + print(df.columns.values.tolist(), file=open(column_info, "a")) + + fig = plt.figure(figsize=(20, 15)) + sns.heatmap(df.corr(), fmt=".2g", cmap="coolwarm") # annot = True, + plt.savefig(f"correlation_plot.pdf", format="pdf", dpi=1000, bbox_inches="tight") + + # df.dropna(axis=1, how='all', inplace=True) + # df['ID'] = [f'var_{num}' for num in range(len(df))] + print("NAs filled!") + df = pd.concat([var.reset_index(drop=True), df], axis=1) + return df + + +def main(df, config_dict, stats, column_info, null_info): + + print("\nData shape (Before filtering) =", df.shape, file=open(stats, "w")) + df = extract_col(config_dict, df, stats) + print("Columns extracted! Extracting class info....") + df.isnull().sum(axis=0).to_csv(null_info) + # df.drop_duplicates() + df.dropna(axis=1, how="all", inplace=True) + df = fill_na(df, config_dict, column_info, stats) + return df + + +if __name__ == "__main__": + + parser = argparse.ArgumentParser() + parser.add_argument( + "--out-dir", "-O", type=str, required=True, help="File path to output directory" + ) + parser.add_argument( + "--input", "-i", type=str, required=True, help="Input file with path" + ) + + args = parser.parse_args() + + config_f = "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/configs/testing.yaml" + # read QA config file + config_dict = get_col_configs(config_f) + print("Config file loaded!") + + print("Loading data...") + var_f = pd.read_csv(args.input, sep="\t", usecols=config_dict["columns"]) + print("Data Loaded !....") + + if not os.path.exists(args.out_dir): + os.makedirs(args.out_dir) + os.chdir(args.out_dir) + stats = "stats_nssnv.csv" + # print("Filtering "+var+" variants with at-least 50 percent data for each variant...") + column_info = "columns.csv" + null_info = "Nulls.csv" + df = main(var_f, config_dict, stats, column_info, null_info) + + print("\nData shape (After filtering) =", df.shape, file=open(stats, "a")) + print("writing to csv...") + df.to_csv("data.csv", index=False) + del df diff --git a/src/Ditto/model.job b/src/Ditto/model.job new file mode 100644 index 0000000..d41f5a3 --- /dev/null +++ b/src/Ditto/model.job @@ -0,0 +1,30 @@ +#!/bin/bash +# +#SBATCH --job-name=Ditto_ranks +#SBATCH --output=Ditto_ranks.out +# +# Number of tasks needed for this job. Generally, used with MPI jobs +#SBATCH --ntasks=1 +#SBATCH --partition=express +# +# Number of CPUs allocated to each task. +#SBATCH --cpus-per-task=10 +# +# Mimimum memory required per allocated CPU in MegaBytes. +#SBATCH --mem=40G +# +# Send mail to the email address when the job fails +#SBATCH --mail-type=FAIL +#SBATCH --mail-user=tmamidi@uab.edu + +#Set your environment here +module load Anaconda3/2020.02 +source activate testing + +#Run your commands here +#python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/filter_vcf_by_DP8_AB --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json +#python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/filter_vcf_by_DP6_AB --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json +python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/trial/filter_vcf_by_DP8_AB_hpo_removed --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json +#python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/filter_vcf_by_DP6_AB_hpo_removed --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json +#python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/annotated_vcf --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json +#python ranks.py -id /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/annotated_vcf_hpo_removed --json /data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json diff --git a/src/Ditto/predict.py b/src/Ditto/predict.py new file mode 100644 index 0000000..6109505 --- /dev/null +++ b/src/Ditto/predict.py @@ -0,0 +1,110 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +import pandas as pd +import yaml +import warnings + +warnings.simplefilter("ignore") +from joblib import load +import argparse +import os +import glob + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--input", + "-i", + type=str, + required=True, + help="Input csv file with path for predictions", + ) + parser.add_argument( + "--sample", + type=str, + # required=True, + help="Input sample name to showup in results", + ) + parser.add_argument( + "--output", + "-o", + type=str, + default="ditto_predictions.csv", + help="Output csv file with path", + ) + parser.add_argument( + "--output100", + "-o100", + type=str, + default="ditto_predictions_100.csv", + help="Output csv file with path for Top 100 variants", + ) + # parser.add_argument( + # "--variant", + # type=str, + # help="Check index/rank of variant of interest. Format: chrX,101412604,C,T") + args = parser.parse_args() + + # print("Loading data....") + + with open( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/configs/testing.yaml" + ) as fh: + config_dict = yaml.safe_load(fh) + + # with open('SL212589_genes.yaml') as fh: + # config_dict = yaml.safe_load(fh) + + X = pd.read_csv(args.input) + X_test = X + # print('Data Loaded!') + var = X_test[config_dict["ML_VAR"]] + X_test = X_test.drop(config_dict["ML_VAR"], axis=1) + X_test = X_test.values + + with open( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/models/F_3_0_1_nssnv/StackingClassifier_F_3_0_1_nssnv.joblib", + "rb", + ) as f: + clf = load(f) + + # print('Ditto Loaded!\nRunning predictions.....') + + y_score = clf.predict_proba(X_test) + del X_test + # print('Predictions finished!\nSorting ....') + pred = pd.DataFrame(y_score, columns=["Ditto_Benign", "Ditto_Deleterious"]) + + overall = pd.concat([var, pred], axis=1) + + # overall = overall.merge(X,on='Gene') + del X, pred, y_score, clf + overall.drop_duplicates(inplace=True) + overall.insert(0, "PROBANDID", args.sample) + overall["SD"] = 0 + overall["C"] = "*" + overall = overall.sort_values("Ditto_Deleterious", ascending=False) + # print('writing to database...') + overall.to_csv(args.output, index=False) + # print('Database storage complete!') + + overall = overall.drop_duplicates( + subset=["Chromosome", "Position", "Alternate Allele", "Reference Allele"], + keep="first", + ).reset_index(drop=True) + overall = overall[ + [ + "PROBANDID", + "Chromosome", + "Position", + "Reference Allele", + "Alternate Allele", + "Ditto_Deleterious", + "SD", + "C", + ] + ] + overall.columns = ["PROBANDID", "CHROM", "POS", "REF", "ALT", "P", "SD", "C"] + overall.head(100).to_csv(args.output100, index=False, sep=":") + del overall diff --git a/src/Ditto/ranks.py b/src/Ditto/ranks.py new file mode 100644 index 0000000..e652572 --- /dev/null +++ b/src/Ditto/ranks.py @@ -0,0 +1,73 @@ +import json +import pandas as pd +import argparse + +parser = argparse.ArgumentParser() +parser.add_argument( + "--input-dir", + "-id", + type=str, + required=True, + help="Input raw annotated file with path.", +) +parser.add_argument( + "--json", type=str, required=True, help="Input raw annotated file with path." +) +parser.add_argument( + "--output", "-o", type=str, default="ranks.csv", help="Output csv filename only" +) +args = parser.parse_args() + +# json_file = json.load(open("/data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json", 'r')) +json_file = json.load(open(args.json, "r")) + +with open(f"{args.input_dir}/{args.output}", "w") as f: + f.write(f"PROBANDID,[CHROM,POS,REF,ALT],SYMBOL,Exomiser,Ditto,Combined,Rank\n") +rank_list = [] +for samples in json_file["train"].keys(): + if "PROBAND" in samples: + # genes = pd.read_csv(f"/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/filter_vcf_by_DP6_AB/train/{samples}/combined_predictions.csv")#, sep=':') + genes = pd.read_csv( + f"{args.input_dir}/train/{samples}/combined_predictions.csv" + ) # , sep=':') + # genes = genes.sort_values(by = ['E','P'], axis=0, ascending=[False,False], kind='quicksort', ignore_index=True) + # genes = genes.drop_duplicates(subset=['Chromosome','Position','Alternate Allele','Reference Allele'], keep='first').reset_index(drop=True) + genes = genes.drop_duplicates( + subset=["CHROM", "POS", "ALT", "REF"], keep="first" + ).reset_index(drop=True) + for i in range(len(json_file["train"][samples]["solves"])): + variants = str( + "chr" + + str(json_file["train"][samples]["solves"][i]["Chrom"]).split(".")[0] + + "," + + str(json_file["train"][samples]["solves"][i]["Pos"]) + + "," + + json_file["train"][samples]["solves"][i]["Ref"] + + "," + + json_file["train"][samples]["solves"][i]["Alt"] + ).split(",") + # rank = ((genes.loc[(genes['Chromosome'] == variants[0]) & (genes['Position'] == int(variants[1])) & (genes['Alternate Allele'] == variants[3]) & (genes['Reference Allele'] == variants[2])].index)+1) + rank = ( + genes.loc[ + (genes["CHROM"] == variants[0]) + & (genes["POS"] == int(variants[1])) + & (genes["ALT"] == variants[3]) + & (genes["REF"] == variants[2]) + ].index + ) + 1 + rank_list = [*rank_list, *rank] # unpack both iterables in a list literal + with open(f"{args.input_dir}/{args.output}", "a") as f: + f.write( + f"{samples}, {variants}, {genes.loc[rank-1]['SYMBOL'].values}, {genes.loc[rank-1]['E'].values}, {genes.loc[rank-1]['D'].values}, {genes.loc[rank-1]['P'].values}, {rank.tolist()}\n" + ) + # f.write(f"{samples}, {variants}, {genes.loc[rank-1]['SYMBOL'].values}, {genes.loc[rank-1]['Ditto_Deleterious'].values}, {rank.tolist()}\n") + +with open(f"{args.input_dir}/{args.output}", "a") as f: + # f.write(f"\nList,{rank_list}\n") + f.write(f"Rank-1,{sum(i < 2 for i in rank_list)}\n") + f.write(f"Rank-5,{sum(i < 6 for i in rank_list)}\n") + f.write(f"Rank-10,{sum(i < 11 for i in rank_list)}\n") + f.write(f"Rank-20,{sum(i < 21 for i in rank_list)}\n") + f.write(f"Rank-50,{sum(i < 51 for i in rank_list)}\n") + f.write(f"Rank-100,{sum(i < 101 for i in rank_list)}\n") + f.write(f"#Predictions,{len(rank_list)}\n") diff --git a/src/Ditto/submission.py b/src/Ditto/submission.py new file mode 100644 index 0000000..f4cf87f --- /dev/null +++ b/src/Ditto/submission.py @@ -0,0 +1,29 @@ +import json +import pandas as pd + +json_file = json.load( + open( + "/data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json", + "r", + ) +) + +fnames = [] +for samples in json_file["test"].keys(): + # for train_test in json_file.keys(): + # if "TEST" in train_test: + # for samples in json_file[train_test].keys(): + if "PROBAND" in samples: + # fnames.append(train_test+samples) + fnames.append( + f"/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/annotated_vcf/test/{samples}/combined_predictions_100.csv" + ) # , sep=':') +# print(fnames) +model = pd.concat((pd.read_csv(f, sep=":") for f in fnames), ignore_index=True) +model["SD"] = 0 +model["C"] = "*" +model.to_csv( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/debugged/ditto_model_debugged_annotated_vcf.txt", + index=False, + sep=":", +) diff --git a/src/predict_variant_score.sh b/src/predict_variant_score.sh new file mode 100755 index 0000000..97bd231 --- /dev/null +++ b/src/predict_variant_score.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +#SBATCH --job-name=predict_variant_score +#SBATCH --output=logs/predict_variant_score-%j.log +#SBATCH --cpus-per-task=1 +#SBATCH --mem-per-cpu=4G +#SBATCH --partition=short + +set -euo pipefail + +module reset +module load Anaconda3/2020.11 +conda activate testing + +mkdir -p "logs/rule_logs" + +snakemake \ + --snakefile "../workflow/Snakefile" \ + --use-conda \ + --profile '../variant_annotation/configs/snakemake_slurm_profile/{{cookiecutter.profile_name}}' \ + --cluster-config '../configs/cluster_config.json' \ + --cluster 'sbatch --ntasks {cluster.ntasks} --partition {cluster.partition} --cpus-per-task {cluster.cpus-per-task} --mem-per-cpu {cluster.mem-per-cpu} --output {cluster.output} --parsable' diff --git a/src/slurm-launch.py b/src/slurm-launch.py new file mode 100644 index 0000000..099309e --- /dev/null +++ b/src/slurm-launch.py @@ -0,0 +1,128 @@ +# slurm-launch.py +# Usage: +# for i in {0..4}; do python slurm-launch.py --exp-name Ditto_tuning --command "python optuna-tpe-stacking_training.ipy --vtype snv_protein_coding" sleep 60 ; done + +import argparse +import subprocess +import sys +import time +import os + +# template_file = "slurm-template.sh" #Path(__file__) / +JOB_NAME = "${JOB_NAME}" +NUM_NODES = "${NUM_NODES}" +NUM_GPUS_PER_NODE = "${NUM_GPUS_PER_NODE}" +NUM_CPUS_PER_NODE = "${NUM_CPUS_PER_NODE}" +TOT_MEM = "${TOT_MEM}" +PARTITION_OPTION = "${PARTITION_OPTION}" +COMMAND_PLACEHOLDER = "${COMMAND_PLACEHOLDER}" +GIVEN_NODE = "${GIVEN_NODE}" +LOAD_ENV = "${LOAD_ENV}" + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--exp-name", + type=str, + required=True, + help="The job name and path to logging file (exp_name.log).", + ) + parser.add_argument( + "--slurm-template", + "-temp", + type=str, + default="./", + help="Path to slurm template. Default: ./ (current location)", + ) + parser.add_argument( + "--num-nodes", "-n", type=int, default=1, help="Number of nodes to use." + ) + parser.add_argument( + "--node", + "-w", + type=str, + help="The specified nodes to use. Same format as the " + "return of 'sinfo'. Default: ''.", + ) + parser.add_argument( + "--num-gpus", + type=int, + default=0, + help="Number of GPUs to use in each node. (Default: 0)", + ) + parser.add_argument( + "--num-cpus", + type=int, + default=10, + help="Number of CPUs to use in each node. (Default: 10)", + ) + parser.add_argument( + "--mem", type=str, default="150G", help="Total Memory to use. (Default: 150G)" + ) + parser.add_argument( + "--partition", type=str, default="short", help="Default partition: short" + ) + parser.add_argument( + "--load-env", + type=str, + default="training", + help="Environment name to load before running script. (Default: 'training')", + ) + parser.add_argument( + "--command", + type=str, + required=True, + help="The command you wish to execute. For example: " + " --command 'python ML_models.py'. " + "Note that the command must be a string.", + ) + args = parser.parse_args() + + if args.node: + # assert args.num_nodes == 1 + node_info = "#SBATCH -w {}".format(args.node) + else: + node_info = "" + + job_name = "{}_{}".format( + args.exp_name, time.strftime("%m%d-%H%M%S", time.localtime()) + ) + + partition_option = ( + "#SBATCH --partition={}".format(args.partition) if args.partition else "" + ) + + # ===== Modified the template script ===== + with open(f"{args.slurm_template}slurm-template.sh", "r") as f: + text = f.read() + text = text.replace(JOB_NAME, job_name) + text = text.replace(NUM_NODES, str(args.num_nodes)) + text = text.replace(NUM_GPUS_PER_NODE, str(args.num_gpus)) + text = text.replace(NUM_CPUS_PER_NODE, str(args.num_cpus)) + text = text.replace(TOT_MEM, str(args.mem)) + text = text.replace(PARTITION_OPTION, str(args.partition)) + text = text.replace(COMMAND_PLACEHOLDER, str(args.command)) + text = text.replace(LOAD_ENV, str(args.load_env)) + text = text.replace(GIVEN_NODE, node_info) + text = text.replace( + "# THIS FILE IS A TEMPLATE AND IT SHOULD NOT BE DEPLOYED TO " "PRODUCTION!", + "# THIS FILE IS MODIFIED AUTOMATICALLY FROM TEMPLATE AND SHOULD BE " + "RUNNABLE!", + ) + + # ===== Save the script ===== + if not os.path.exists("./logs/"): + os.makedirs("./logs/") + script_file = "./logs/{}.sh".format(job_name) + with open(script_file, "w") as f: + f.write(text) + + # ===== Submit the job ===== + print("Starting to submit job!") + subprocess.Popen(["sbatch", script_file]) + print( + "Job submitted! Script file is at: <{}>. Log file is at: <{}>".format( + script_file, "./logs/{}.log".format(job_name) + ) + ) + sys.exit(0) diff --git a/src/slurm-template.sh b/src/slurm-template.sh new file mode 100644 index 0000000..cb08db8 --- /dev/null +++ b/src/slurm-template.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# shellcheck disable=SC2206 +# THIS FILE IS GENERATED BY AUTOMATION SCRIPT! PLEASE REFER TO ORIGINAL SCRIPT! +# THIS FILE IS A TEMPLATE AND IT SHOULD NOT BE DEPLOYED TO PRODUCTION! +#SBATCH --job-name=${JOB_NAME} +#SBATCH --output=./logs/${JOB_NAME}.log +${GIVEN_NODE} +### This script works for any number of nodes, Ray will find and manage all resources +#SBATCH --nodes=${NUM_NODES} +#SBATCH --exclusive +#SBATCH --partition=${PARTITION_OPTION} +### Give all resources to a single Ray task, ray can manage the resources internally +#SBATCH --ntasks-per-node=1 +#SBATCH --cpus-per-task=${NUM_CPUS_PER_NODE} +#SBATCH --mem=${TOT_MEM} +#SBATCH --gpus-per-task=${NUM_GPUS_PER_NODE} +# Send mail to the email address when the job fails +#SBATCH --mail-type=FAIL +#SBATCH --mail-user=tmamidi@uab.edu +# Load modules or your own conda environment here +module load Anaconda3/2020.02 +# conda activate ${CONDA_ENV} +conda activate ${LOAD_ENV} +#module load CUDA/10.1.243 +#module load cuDNN/7.6.2.24-CUDA-10.1.243 + +## ===== DO NOT CHANGE THINGS HERE UNLESS YOU KNOW WHAT YOU ARE DOING ===== +## This script is a modification to the implementation suggest by gregSchwartz18 here: +## https://github.com/ray-project/ray/issues/826#issuecomment-522116599 +redis_password=$(uuidgen) +export redis_password + +nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") # Getting the node names +nodes_array=($nodes) + +node_1=${nodes_array[0]} +ip=$(srun --nodes=1 --ntasks=1 -w "$node_1" hostname --ip-address) # making redis-address + +# if we detect a space character in the head node IP, we'll +# convert it to an ipv4 address. This step is optional. +if [[ "$ip" == *" "* ]]; then + IFS=' ' read -ra ADDR <<< "$ip" + if [[ ${#ADDR[0]} -gt 16 ]]; then + ip=${ADDR[1]} + else + ip=${ADDR[0]} + fi + echo "IPV6 address detected. We split the IPV4 address as $ip" +fi + +port=6379 +ip_head=$ip:$port +export ip_head +echo "IP Head: $ip_head" + +echo "STARTING HEAD at $node_1" +srun --nodes=1 --ntasks=1 -w "$node_1" \ + ray start --head --node-ip-address="$ip" --port=$port --redis-password="$redis_password" --temp-dir=$USER_SCRATCH --block & +sleep 30 + +worker_num=$((SLURM_JOB_NUM_NODES - 1)) #number of nodes other than the head node +for ((i = 1; i <= worker_num; i++)); do + node_i=${nodes_array[$i]} + echo "STARTING WORKER $i at $node_i" + srun --nodes=1 --ntasks=1 -w "$node_i" ray start --address "$ip_head" --redis-password="$redis_password" --block & + sleep 5 +done + +# ===== Call your code below ===== +${COMMAND_PLACEHOLDER} \ No newline at end of file diff --git a/src/training/data-prep/extract_class.py b/src/training/data-prep/extract_class.py new file mode 100644 index 0000000..4bbe418 --- /dev/null +++ b/src/training/data-prep/extract_class.py @@ -0,0 +1,35 @@ +import os +import gzip + +os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/interim/" +) +vcf = "merged_sig_norm.vcf.gz" +vcf1 = "merged_sig_norm_vep-annotated.tsv" +output = "merged_sig_norm_class_vep-annotated.tsv" + +print("Collecting variant class...") +class_dict = dict() +with gzip.open(vcf, "rt") as vcffp: + for cnt, line in enumerate(vcffp): + if not line.startswith("#"): + line = line.rstrip("\n") + cols = line.split("\t") + var_info = cols[0] + "\t" + cols[1] + "\t" + cols[3] + "\t" + cols[4] + # hgmd_class = cols[7].split(";")[0].split("=")[1] + class_dict[var_info] = cols[5] + +# print(class_dict) +print("Writing variant class...") +with open(output, "w") as out: + with open(vcf1, "rt") as vcffp: + for cnt, line in enumerate(vcffp): + if not line.startswith("Chromosome"): + line = line.rstrip("\n") + cols = line.split("\t") + var_info = cols[0] + "\t" + cols[1] + "\t" + cols[2] + "\t" + cols[3] + new_line = line + "\t" + class_dict[var_info] + out.write(new_line + "\n") + # print(line+"\t"+class_dict[var_info]) + else: + out.write(line.rstrip("\n") + "\thgmd_class\n") diff --git a/src/training/data-prep/extract_variants.py b/src/training/data-prep/extract_variants.py new file mode 100644 index 0000000..466a868 --- /dev/null +++ b/src/training/data-prep/extract_variants.py @@ -0,0 +1,67 @@ +import os +import gzip +import yaml +import re + +regex = re.compile("[@_!#$%^&*()<>?/\|}{~:]") + +os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/interim/" +) +vcf = "merged_norm.vcf.gz" +output = "merged_sig_norm.vcf" + +with open("../../configs/columns_config.yaml") as fh: + config_dict = yaml.safe_load(fh) + +cln = hgmd = 0 +print("Collecting variant class...") +with open(output, "w") as out: + with gzip.open(vcf, "rt") as vcffp: # gzip. + for cnt, line in enumerate(vcffp): + if not line.startswith("#"): + line = line.rstrip("\n") + cols = line.split("\t") + if ( + (len(cols[3]) < 30000) + and (len(cols[4]) < 30000) + and (regex.search(cols[3]) == None) + and (regex.search(cols[4]) == None) + ): + var_info = ( + cols[0] + + "\t" + + cols[1] + + "\t" + + cols[2] + + "\t" + + cols[3] + + "\t" + + cols[4] + ) + if "CLASS" in cols[7]: + var_class = cols[7].split(";")[0].split("=")[1] + if var_class in config_dict["ClinicalSignificance"]: + hgmd = hgmd + 1 + new_line = var_info + "\t" + var_class + out.write(new_line + "\n") + elif "CLNSIG" in line: + var_class = cols[7].split(";CLN")[5].split("=")[1] + var_sub = cols[7].split(";CLN")[4].split("=")[1] + if (var_class in config_dict["ClinicalSignificance"]) and ( + var_sub in config_dict["CLNREVSTAT"] + ): + cln = cln + 1 + new_line = var_info + "\t" + var_class + out.write(new_line + "\n") + # class_dict[var_info] = var_class + else: + pass + else: + pass + else: + pass + else: + out.write(line) + +print(f"Clinvar variants: {cln}\nHGMD variants: {hgmd}\n") diff --git a/src/training/data-prep/filter.py b/src/training/data-prep/filter.py new file mode 100644 index 0000000..0a3d346 --- /dev/null +++ b/src/training/data-prep/filter.py @@ -0,0 +1,380 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# python slurm-launch.py --exp-name no_null --command "python training/data-prep/filter.py --var-tag no_null_nssnv --cutoff 1" --mem 50G + +import pandas as pd + +pd.set_option("display.max_rows", None) +import numpy as np +from tqdm import tqdm +import seaborn as sns +import yaml +import os +import argparse +import matplotlib.pyplot as plt + +# from sklearn.linear_model import LinearRegression +# from sklearn.experimental import enable_iterative_imputer +# from sklearn.impute import IterativeImputer +# import pickle + + +def get_col_configs(config_f): + with open(config_f) as fh: + config_dict = yaml.safe_load(fh) + + # print(config_dict) + return config_dict + + +def extract_col(config_dict, df, stats, list_tag): + print("Extracting columns and rows according to config file !....") + df = df[config_dict["columns"]] + if "non_snv" in stats: + # df= df.loc[df['hgmd_class'].isin(config_dict['Clinsig_train'])] + df = df[ + (df["Alternate Allele"].str.len() > 1) + | (df["Reference Allele"].str.len() > 1) + ] + print("\nData shape (non-snv) =", df.shape, file=open(stats, "a")) + else: + # df= df.loc[df['hgmd_class'].isin(config_dict['Clinsig_train'])] + df = df[ + (df["Alternate Allele"].str.len() < 2) + & (df["Reference Allele"].str.len() < 2) + ] + if "protein" in stats: + df = df[df["BIOTYPE"] == "protein_coding"] + else: + pass + print("\nData shape (snv) =", df.shape, file=open(stats, "a")) + # df = df[config_dict['Consequence']] + df = df.loc[df["Consequence"].isin(config_dict["Consequence"])] + print("\nData shape (nsSNV) =", df.shape, file=open(stats, "a")) + if "train" in stats: + df = df.loc[df["hgmd_class"].isin(config_dict["Clinsig_train"])] + else: + df = df.loc[df["hgmd_class"].isin(config_dict["Clinsig_test"])] + + if "train" in stats: + print("Dropping empty columns and rows along with duplicate rows...") + # df.dropna(axis=1, thresh=(df.shape[1]*0.3), inplace=True) #thresh=(df.shape[0]/4) + df.dropna( + axis=0, thresh=(df.shape[1] * list_tag[1]), inplace=True + ) # thresh=(df.shape[1]*0.3), how='all', + df.drop_duplicates() + df.dropna(axis=1, how="all", inplace=True) # thresh=(df.shape[0]/4) + print("\nhgmd_class:\n", df["hgmd_class"].value_counts(), file=open(stats, "a")) + print( + "\nclinvar_CLNSIG:\n", + df["clinvar_CLNSIG"].value_counts(), + file=open(stats, "a"), + ) + print( + "\nclinvar_CLNREVSTAT:\n", + df["clinvar_CLNREVSTAT"].value_counts(), + file=open(stats, "a"), + ) + print("\nConsequence:\n", df["Consequence"].value_counts(), file=open(stats, "a")) + print("\nIMPACT:\n", df["IMPACT"].value_counts(), file=open(stats, "a")) + print("\nBIOTYPE:\n", df["BIOTYPE"].value_counts(), file=open(stats, "a")) + # df = df.drop(['CLNVC','MC'], axis=1) + # CLNREVSTAT, CLNVC, MC + return df + + +def fill_na(df, config_dict, column_info, stats, list_tag): # (config_dict,df): + + var = df[config_dict["var"]] + df = df.drop(config_dict["var"], axis=1) + print("parsing difficult columns......") + # df['GERP'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['GERP'].str.split('&')] + if "nssnv" in stats: + # df['MutationTaster_score'] = [np.mean([float(item.replace('.', '0')) if item == '.' else float(item) for item in i]) if type(i) is list else i for i in df['MutationTaster_score'].str.split('&')] + # else: + for col in tqdm(config_dict["col_conv"]): + df[col] = [ + np.mean( + [ + float(item.replace(".", "0")) if item == "." else float(item) + for item in i.split("&") + ] + ) + if "&" in str(i) + else i + for i in df[col] + ] + df[col] = df[col].astype("float64") + if "train" in stats: + fig = plt.figure(figsize=(20, 15)) + sns.heatmap(df.corr(), fmt=".2g", cmap="coolwarm") # annot = True, + plt.savefig( + f"train_{list_tag[0]}/correlation_filtered_raw_{list_tag[0]}.pdf", + format="pdf", + dpi=1000, + bbox_inches="tight", + ) + print("One-hot encoding...") + df = pd.get_dummies(df, prefix_sep="_") + print(df.columns.values.tolist(), file=open(column_info, "w")) + # df.head(2).to_csv(column_info, index=False) + # lr = LinearRegression() + # imp= IterativeImputer(estimator=lr, verbose=2, max_iter=10, tol=1e-10, imputation_order='roman') + print("Filling NAs ....") + # df = imp.fit_transform(df) + # df = pd.DataFrame(df, columns = columns) + + if list_tag[2] == 1: + print("Including AF columns...") + df1 = df[config_dict["gnomad_columns"]] + df1 = df1.fillna(list_tag[3]) + + if list_tag[4] == 1: + df = df.drop(config_dict["gnomad_columns"], axis=1) + df = df.fillna(df.median()) + if "train" in stats: + print("\nColumns:\t", df.columns.values.tolist(), file=open(stats, "a")) + print( + "\nMedian values:\t", + df.median().values.tolist(), + file=open(stats, "a"), + ) + else: + pass + else: + print("Excluding AF columns...") + if list_tag[4] == 1: + df = df.drop(config_dict["gnomad_columns"], axis=1) + df1 = df.fillna(df.median()) + if "train" in stats: + print("\nColumns:\t", df.columns.values.tolist(), file=open(stats, "a")) + print( + "\nMedian values:\t", + df.median().values.tolist(), + file=open(stats, "a"), + ) + else: + df1 = pd.DataFrame() + + if "non_nssnv" in stats: + for key in tqdm(config_dict["non_nssnv_columns"]): + if key in df.columns: + df1[key] = ( + df[key] + .fillna(config_dict["non_nssnv_columns"][key]) + .astype("float64") + ) + else: + df1[key] = config_dict["non_nssnv_columns"][key] + else: + for key in tqdm(config_dict["nssnv_columns"]): + if key in df.columns: + df1[key] = ( + df[key].fillna(config_dict["nssnv_columns"][key]).astype("float64") + ) + else: + df1[key] = config_dict["nssnv_columns"][key] + df = df1 + df = df.drop(df.std()[(df.std() == 0)].index, axis=1) + del df1 + df = df.reset_index(drop=True) + print(df.columns.values.tolist(), file=open(column_info, "a")) + if "train" in stats: + fig = plt.figure(figsize=(20, 15)) + sns.heatmap(df.corr(), fmt=".2g", cmap="coolwarm") # annot = True, + plt.savefig( + f"train_{list_tag[0]}/correlation_before_{list_tag[0]}.pdf", + format="pdf", + dpi=1000, + bbox_inches="tight", + ) + + # Create correlation matrix + corr_matrix = df.corr().abs() + + # Select upper triangle of correlation matrix + upper = corr_matrix.where( + np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool) + ) + + # Find features with correlation greater than 0.90 + to_drop = [column for column in upper.columns if any(upper[column] > 0.90)] + print( + f"Correlated columns being dropped: {to_drop}", file=open(column_info, "a") + ) + + # Drop features + df.drop(to_drop, axis=1, inplace=True) + df = df.reset_index(drop=True) + print(df.columns.values.tolist(), file=open(column_info, "a")) + # df.dropna(axis=1, how='all', inplace=True) + df["ID"] = [f"var_{num}" for num in range(len(df))] + print("NAs filled!") + df = pd.concat([var.reset_index(drop=True), df], axis=1) + return df + + +def main(df, config_f, stats, column_info, null_info, list_tag): + # read QA config file + config_dict = get_col_configs(config_f) + print("Config file loaded!") + # read clinvar data + + print("\nData shape (Before filtering) =", df.shape, file=open(stats, "a")) + df = extract_col(config_dict, df, stats, list_tag) + print("Columns extracted! Extracting class info....") + df.isnull().sum(axis=0).to_csv(null_info) + # print('\n Unique Impact (Class):\n', df.hgmd_class.unique(), file=open("./data/processed/stats1.csv", "a")) + df["hgmd_class"] = ( + df["hgmd_class"] + .str.replace(r"DFP", "high_impact") + .str.replace(r"DM\?", "high_impact") + .str.replace(r"DM", "high_impact") + ) + df["hgmd_class"] = ( + df["hgmd_class"] + .str.replace(r"Pathogenic/Likely_pathogenic", "high_impact") + .str.replace(r"Likely_pathogenic", "high_impact") + .str.replace(r"Pathogenic", "high_impact") + ) + df["hgmd_class"] = ( + df["hgmd_class"] + .str.replace(r"DP", "low_impact") + .str.replace(r"FP", "low_impact") + ) + df["hgmd_class"] = ( + df["hgmd_class"] + .str.replace(r"Benign/Likely_benign", "low_impact") + .str.replace(r"Likely_benign", "low_impact") + .str.replace(r"Benign", "low_impact") + ) + df.drop_duplicates() + df.dropna(axis=1, how="all", inplace=True) + y = df["hgmd_class"] + class_dummies = pd.get_dummies(df["hgmd_class"]) + # del class_dummies[class_dummies.columns[-1]] + print("\nImpact (Class):\n", y.value_counts(), file=open(stats, "a")) + # y = df.hgmd_class + df = df.drop("hgmd_class", axis=1) + df = fill_na(df, config_dict, column_info, stats, list_tag) + + if "train" in stats: + var = df[config_dict["ML_VAR"]] + df = df.drop(config_dict["ML_VAR"], axis=1) + df = pd.concat([class_dummies.reset_index(drop=True), df], axis=1) + fig = plt.figure(figsize=(20, 15)) + sns.heatmap(df.corr(), fmt=".2g", cmap="coolwarm") + plt.savefig( + f"train_{list_tag[0]}/correlation_after_{list_tag[0]}.pdf", + format="pdf", + dpi=1000, + bbox_inches="tight", + ) + df = pd.concat([var, df], axis=1) + df = df.drop(["high_impact", "low_impact"], axis=1) + return df, y + + +if __name__ == "__main__": + + parser = argparse.ArgumentParser() + parser.add_argument( + "--var-tag", + "-v", + type=str, + required=True, + default="nssnv", + help="The tag used when generating train/test data. Default:'nssnv'", + ) + parser.add_argument( + "--cutoff", + type=float, + default=0.5, + help=f"Cutoff to include at least __% of data for all rows. Default:0.5 (i.e. 50%)", + ) + parser.add_argument( + "--af-columns", + "-af", + type=int, + default=0, + help=f"To include columns with Allele frequencies or not. Default:0", + ) + parser.add_argument( + "--af-values", + "-afv", + type=float, + default=0, + help=f"value to impute nulls for allele frequency columns. Default:0", + ) + parser.add_argument( + "--other-values", + "-otv", + type=int, + default=0, + help=f"Impute other columns with either custom defined values (0) or median (1). Default:0", + ) + + args = parser.parse_args() + list_tag = [ + args.var_tag, + args.cutoff, + args.af_columns, + args.af_values, + args.other_values, + ] + print(list_tag) + var = list_tag[0] + + if not os.path.exists( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test" + ): + os.makedirs( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test" + ) + os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test" + ) + print("Loading data...") + var_f = pd.read_csv( + "../../interim/merged_sig_norm_class_vep-annotated.tsv", sep="\t" + ) + print("Data Loaded !....") + config_f = "../../../configs/columns_config.yaml" + + # variants = ['train_non_snv','train_snv','train_snv_protein_coding','test_snv','test_non_snv','test_snv_protein_coding'] + variants = ["train_" + var, "test_" + var] + # variants = ['train_'+var] + for var in variants: + if not os.path.exists(var): + os.makedirs(var) + stats = var + "/stats_" + var + ".csv" + print( + "Filtering " + + var + + " variants with at-least " + + str(list_tag[1] * 100) + + " percent data for each variant...", + file=open(stats, "w"), + ) + # print("Filtering "+var+" variants with at-least 50 percent data for each variant...") + column_info = var + "/" + var + "_columns.csv" + null_info = var + "/Nulls_" + var + ".csv" + df, y = main(var_f, config_f, stats, column_info, null_info, list_tag) + if "train" in stats: + train_columns = df.columns.values.tolist() + else: + df1 = pd.DataFrame() + for key in tqdm(train_columns): + if key in df.columns: + df1[key] = df[key] + else: + df1[key] = 0 + df = df1 + del df1 + + print("\nData shape (After filtering) =", df.shape, file=open(stats, "a")) + print("Class shape=", y.shape, file=open(stats, "a")) + print("writing to csv...") + df.to_csv(var + "/" + "merged_data-" + var + ".csv", index=False) + y.to_csv(var + "/" + "merged_data-y-" + var + ".csv", index=False) + del df, y diff --git a/src/training/data-prep/parse_clinvar.py b/src/training/data-prep/parse_clinvar.py new file mode 100644 index 0000000..779ff98 --- /dev/null +++ b/src/training/data-prep/parse_clinvar.py @@ -0,0 +1,26 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +# module load Anaconda3/2020.02 +# source activate envi +# python /data/project/worthey_lab/projects/experimental_pipelines/annovar_vcf_annotation/Annovar_Tarun.py /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/external/clinvar.vcf /data/scratch/tmamidi/ /data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/interim/ /data/project/worthey_lab/tools/annovar/annovar_hg19_db + +import allel + +# print(allel.__version__) + +import os + +os.chdir("/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/") +# print(os.listdir()) + +print("Converting vcf.....") +# df = allel.vcf_to_dataframe('./interim/try.vcf', fields='*') +df = allel.vcf_to_dataframe("./interim/clinvar.out.hg19_multianno.vcf", fields="*") +# df.head(2) +# df.SIFT_score.unique() +print("vcf converted to dataframe.\nWriting it to a csv file.....") +df.to_csv("./external/clinvar.out.hg19_multianno.csv", index=False) +print("vcf to csv conversion completed!") +# df.to_csv("./external/clinvar.out.hg19_multianno.csv", index=False) +# print(df.head(20)) diff --git a/src/training/training/ML_models.py b/src/training/training/ML_models.py new file mode 100644 index 0000000..892a7bd --- /dev/null +++ b/src/training/training/ML_models.py @@ -0,0 +1,232 @@ +# python slurm-launch.py --exp-name no_AF_F_50 --command "python training/training/ML_models_AF0.py --var-tag no_AF_F_50" + +import numpy as np +import pandas as pd +import time +import argparse +import ray + +# Start Ray. +ray.init(ignore_reinit_error=True) +import warnings + +warnings.simplefilter("ignore") +from joblib import dump, load +import shap + +# from sklearn.preprocessing import StandardScaler +# from sklearn.preprocessing import MinMaxScaler +from sklearn.model_selection import train_test_split, cross_validate +from sklearn.preprocessing import label_binarize +from sklearn.metrics import ( + precision_score, + roc_auc_score, + accuracy_score, + confusion_matrix, + recall_score, + plot_roc_curve, + plot_precision_recall_curve, +) +from sklearn.tree import DecisionTreeClassifier +from sklearn.ensemble import ( + RandomForestClassifier, + AdaBoostClassifier, + GradientBoostingClassifier, + ExtraTreesClassifier, +) +from sklearn.naive_bayes import GaussianNB +from imblearn.ensemble import BalancedRandomForestClassifier +from sklearn.neural_network import MLPClassifier +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis +import matplotlib.pyplot as plt +import yaml +import gc +import os + +os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test/" +) + +TUNE_STATE_REFRESH_PERIOD = 10 # Refresh resources every 10 s + +# @ray.remote(num_returns=6) +def data_parsing(var, config_dict, output): + # Load data + # print(f'\nUsing merged_data-train_{var}..', file=open(output, "a")) + X_train = pd.read_csv(f"train_{var}/merged_data-train_{var}.csv") + # var = X_train[config_dict['ML_VAR']] + X_train = X_train.drop(config_dict["ML_VAR"], axis=1) + feature_names = X_train.columns.tolist() + X_train = X_train.values + Y_train = pd.read_csv(f"train_{var}/merged_data-y-train_{var}.csv") + Y_train = label_binarize( + Y_train.values, classes=["low_impact", "high_impact"] + ).ravel() + + X_test = pd.read_csv(f"test_{var}/merged_data-test_{var}.csv") + # var = X_test[config_dict['ML_VAR']] + X_test = X_test.drop(config_dict["ML_VAR"], axis=1) + # feature_names = X_test.columns.tolist() + X_test = X_test.values + Y_test = pd.read_csv(f"test_{var}/merged_data-y-test_{var}.csv") + print("Data Loaded!") + # Y = pd.get_dummies(y) + Y_test = label_binarize( + Y_test.values, classes=["low_impact", "high_impact"] + ).ravel() + + # scaler = StandardScaler().fit(X_train) + # X_train = scaler.transform(X_train) + # X_test = scaler.transform(X_test) + # explain all the predictions in the test set + background = shap.kmeans(X_train, 10) + return X_train, X_test, Y_train, Y_test, background, feature_names + + +@ray.remote # (num_cpus=9) +def classifier( + name, clf, var, X_train, X_test, Y_train, Y_test, background, feature_names, output +): + os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test/" + ) + start = time.perf_counter() + score = cross_validate( + clf, + X_train, + Y_train, + cv=10, + return_train_score=True, + return_estimator=True, + n_jobs=-1, + verbose=0, + scoring=("roc_auc", "neg_log_loss"), + ) + clf = score["estimator"][np.argmin(score["test_neg_log_loss"])] + # y_score = cross_val_predict(clf, X_train, Y_train, cv=5, n_jobs=-1, verbose=0) + # class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train) + # clf.fit(X_train, Y_train) #, class_weight=class_weights) + # name = str(type(clf)).split("'")[1] #.split(".")[3] + with open(f"../models/{var}/{name}_{var}.joblib", "wb") as f: + dump(clf, f, compress="lz4") + # del clf + # with open(f"./models/{var}/{name}_{var}.joblib", 'rb') as f: + # clf = load(f) + y_score = clf.predict(X_test) + prc = precision_score(Y_test, y_score, average="weighted") + recall = recall_score(Y_test, y_score, average="weighted") + roc_auc = roc_auc_score(Y_test, y_score) + # roc_auc = roc_auc_score(Y_test, np.argmax(y_score, axis=1)) + accuracy = accuracy_score(Y_test, y_score) + # score = clf.score(X_train, Y_train) + # matrix = confusion_matrix(np.argmax(Y_test, axis=1), np.argmax(y_score, axis=1)) + matrix = confusion_matrix(Y_test, y_score) + + # explain all the predictions in the test set + # background = shap.kmeans(X_train, 6) + explainer = shap.KernelExplainer(clf.predict, background) + del clf, X_train + background1 = X_test[np.random.choice(X_test.shape[0], 10000, replace=False)] + shap_values = explainer.shap_values(background1) + plt.figure() + shap.summary_plot(shap_values, background1, feature_names, show=False) + # shap.plots.waterfall(shap_values[0], max_display=15) + plt.savefig( + f"../models/{var}/{name}_{var}_features.pdf", + format="pdf", + dpi=1000, + bbox_inches="tight", + ) + del shap_values, background1, explainer + finish = (time.perf_counter() - start) / 60 + with open(output, "a") as f: + f.write( + f"{name}\t{np.mean(score['train_roc_auc'])}\t{np.mean(score['test_roc_auc'])}\t{np.mean(score['train_neg_log_loss'])}\t{np.mean(score['test_neg_log_loss'])}\t{prc}\t{recall}\t{roc_auc}\t{accuracy}\t{finish}\n{matrix}\n" + ) + return None + + +if __name__ == "__main__": + + parser = argparse.ArgumentParser() + parser.add_argument( + "--var-tag", + "-v", + type=str, + required=True, + default="nssnv", + help="The tag used when generating train/test data. Default:'nssnv'", + ) + + args = parser.parse_args() + + # Classifiers I wish to use + classifiers = { + "DecisionTree": DecisionTreeClassifier(class_weight="balanced"), + "RandomForest": RandomForestClassifier(class_weight="balanced", n_jobs=-1), + "BalancedRF": BalancedRandomForestClassifier(), + "AdaBoost": AdaBoostClassifier(), + "ExtraTrees": ExtraTreesClassifier(class_weight="balanced", n_jobs=-1), + "GaussianNB": GaussianNB(), + "LDA": LinearDiscriminantAnalysis(), + "GradientBoost": GradientBoostingClassifier(), + "MLP": MLPClassifier(), + } + + with open("../../../configs/columns_config.yaml") as fh: + config_dict = yaml.safe_load(fh) + + var = args.var_tag + if not os.path.exists("../models/" + var): + os.makedirs("../models/" + var) + output = "../models/" + var + "/ML_results_" + var + "_.csv" + # print('Working with '+var+' dataset...', file=open(output, "a")) + print("Working with " + var + " dataset...") + X_train, X_test, Y_train, Y_test, background, feature_names = data_parsing( + var, config_dict, output + ) + with open(output, "a") as f: + f.write( + "Model\tCross_validate(avg_train_roc_auc)\tCross_validate(avg_test_roc_auc)\tCross_validate(avg_train_neg_log_loss)\tCross_validate(avg_test_neg_log_loss)\tPrecision(test_data)\tRecall\troc_auc\tAccuracy\tTime(min)\tConfusion_matrix[low_impact, high_impact]\n" + ) + remote_ml = [ + classifier.remote( + name, + clf, + var, + X_train, + X_test, + Y_train, + Y_test, + background, + feature_names, + output, + ) + for name, clf in classifiers.items() + ] + ray.get(remote_ml) + gc.collect() + + # prepare plots + fig, [ax_roc, ax_prc] = plt.subplots(1, 2, figsize=(20, 10)) + fig.suptitle(f"Model performances on Testing data with filters", fontsize=20) + + for name, clf in classifiers.items(): + + with open(f"../models/{var}/{name}_{var}.joblib", "rb") as f: + clf = load(f) + + plot_precision_recall_curve(clf, X_test, Y_test, ax=ax_prc, name=name) + plot_roc_curve(clf, X_test, Y_test, ax=ax_roc, name=name) + + ax_roc.set_title("Receiver Operating Characteristic (ROC) curves") + ax_prc.set_title("Precision Recall (PRC) curves") + + ax_roc.grid(linestyle="--") + ax_prc.grid(linestyle="--") + + plt.legend() + plt.savefig( + f"../models/{var}/roc_{var}.pdf", format="pdf", dpi=1000, bbox_inches="tight" + ) + gc.collect() diff --git a/src/training/training/plot_roc.py b/src/training/training/plot_roc.py new file mode 100644 index 0000000..c1439f4 --- /dev/null +++ b/src/training/training/plot_roc.py @@ -0,0 +1,123 @@ +# from numpy import mean +import numpy as np +import pandas as pd +import time +import warnings +import argparse + +warnings.simplefilter("ignore") +from joblib import dump, load + +# from sklearn.preprocessing import StandardScaler +# from sklearn.feature_selection import RFE +# from sklearn.preprocessing import MinMaxScaler +from sklearn.preprocessing import label_binarize +from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve + +# from sklearn.multiclass import OneVsRestClassifier +from sklearn.tree import DecisionTreeClassifier +from sklearn.ensemble import ( + RandomForestClassifier, + AdaBoostClassifier, + GradientBoostingClassifier, + ExtraTreesClassifier, +) +from sklearn.naive_bayes import GaussianNB +from imblearn.ensemble import BalancedRandomForestClassifier +from sklearn.neural_network import MLPClassifier +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis +import matplotlib.pyplot as plt +import yaml +import gc +import os + + +TUNE_STATE_REFRESH_PERIOD = 10 # Refresh resources every 10 s + + +def data_parsing(var, config_dict): + # Load data + X_test = pd.read_csv(f"test_{var}/merged_data-test_{var}.csv") + # var = X_test[config_dict['ML_VAR']] + X_test = X_test.drop(config_dict["ML_VAR"], axis=1) + # feature_names = X_test.columns.tolist() + X_test = X_test.values + Y_test = pd.read_csv(f"test_{var}/merged_data-y-test_{var}.csv") + print("Data Loaded!") + # Y = pd.get_dummies(y) + Y_test = label_binarize( + Y_test.values, classes=["low_impact", "high_impact"] + ).ravel() + + # scaler = StandardScaler().fit(X_train) + # X_train = scaler.transform(X_train) + # X_test = scaler.transform(X_test) + # explain all the predictions in the test set + # background = shap.kmeans(X_train, 10) + return X_test, Y_test + + +if __name__ == "__main__": + os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test/" + ) + + parser = argparse.ArgumentParser() + parser.add_argument( + "--var-tag", + "-v", + type=str, + required=True, + default="nssnv", + help="The tag used when generating train/test data. Default:'nssnv'", + ) + + args = parser.parse_args() + var = args.var_tag + + # Classifiers I wish to use + classifiers = [ + "DecisionTree", + "RandomForest", + "BalancedRF", + "AdaBoost", + "ExtraTrees", + "GaussianNB", + "LDA", + "GradientBoost", + "MLP", + "StackingClassifier", + ] + + with open("../../../configs/columns_config.yaml") as fh: + config_dict = yaml.safe_load(fh) + + if not os.path.exists("../models/" + var): + os.makedirs("../models/" + var) + + print("Working with " + var + " dataset...") + X_test, Y_test = data_parsing(var, config_dict) + + # prepare plots + fig, [ax_roc, ax_prc] = plt.subplots(1, 2, figsize=(20, 10)) + fig.suptitle(f"Model performances on Testing data", fontsize=20) + + for name in classifiers: + + with open(f"../models/{var}/{name}_{var}.joblib", "rb") as f: + clf = load(f) + + plot_precision_recall_curve(clf, X_test, Y_test, ax=ax_prc, name=name) + plot_roc_curve(clf, X_test, Y_test, ax=ax_roc, name=name) + + ax_roc.set_title("Receiver Operating Characteristic (ROC) curves") + ax_prc.set_title("Precision Recall (PRC) curves") + + ax_roc.grid(linestyle="--") + ax_prc.grid(linestyle="--") + + plt.legend() + plt.savefig( + f"../models/{var}/roc_{var}.pdf", format="pdf", dpi=1000, bbox_inches="tight" + ) + gc.collect() diff --git a/src/training/training/stacking.py b/src/training/training/stacking.py new file mode 100644 index 0000000..3656402 --- /dev/null +++ b/src/training/training/stacking.py @@ -0,0 +1,209 @@ +# from numpy import mean +import numpy as np +import pandas as pd +import time +import argparse +import ray + +# Start Ray. +ray.init(ignore_reinit_error=True) +import warnings + +warnings.simplefilter("ignore") +from joblib import dump, load +import shap + +# from sklearn.preprocessing import StandardScaler +# from sklearn.feature_selection import RFE +# from sklearn.preprocessing import MinMaxScaler +from sklearn.model_selection import train_test_split, cross_validate +from sklearn.preprocessing import label_binarize +from sklearn.metrics import ( + precision_score, + roc_auc_score, + accuracy_score, + confusion_matrix, + recall_score, +) + +# from sklearn.multiclass import OneVsRestClassifier +from sklearn.tree import DecisionTreeClassifier +from sklearn.linear_model import LogisticRegression +from sklearn.ensemble import ( + RandomForestClassifier, + AdaBoostClassifier, + GradientBoostingClassifier, + ExtraTreesClassifier, + StackingClassifier, +) +from sklearn.naive_bayes import GaussianNB +from imblearn.ensemble import BalancedRandomForestClassifier +from sklearn.neural_network import MLPClassifier +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis +import matplotlib.pyplot as plt +import yaml +import functools + +print = functools.partial(print, flush=True) +import matplotlib.pyplot as plt +import warnings + +warnings.simplefilter("ignore") +import gc +import os + +os.chdir( + "/data/project/worthey_lab/projects/experimental_pipelines/tarun/ditto/data/processed/train_test" +) + +TUNE_STATE_REFRESH_PERIOD = 10 # Refresh resources every 10 s + +# @ray.remote(num_returns=6) +def data_parsing(var, config_dict, output): + # Load data + # print(f'\nUsing merged_data-train_{var}..', file=open(output, "a")) + X_train = pd.read_csv(f"train_{var}/merged_data-train_{var}.csv") + # var = X_train[config_dict['ML_VAR']] + X_train = X_train.drop(config_dict["ML_VAR"], axis=1) + feature_names = X_train.columns.tolist() + X_train = X_train.values + Y_train = pd.read_csv(f"train_{var}/merged_data-y-train_{var}.csv") + Y_train = label_binarize( + Y_train.values, classes=["low_impact", "high_impact"] + ).ravel() + + X_test = pd.read_csv(f"test_{var}/merged_data-test_{var}.csv") + # var = X_test[config_dict['ML_VAR']] + X_test = X_test.drop(config_dict["ML_VAR"], axis=1) + # feature_names = X_test.columns.tolist() + X_test = X_test.values + Y_test = pd.read_csv(f"test_{var}/merged_data-y-test_{var}.csv") + print("Data Loaded!") + # Y = pd.get_dummies(y) + Y_test = label_binarize( + Y_test.values, classes=["low_impact", "high_impact"] + ).ravel() + + # scaler = StandardScaler().fit(X_train) + # X_train = scaler.transform(X_train) + # X_test = scaler.transform(X_test) + # explain all the predictions in the test set + background = shap.kmeans(X_train, 10) + return X_train, X_test, Y_train, Y_test, background, feature_names + + +# @ray.remote #(num_cpus=9) +def classifier( + clf, var, X_train, X_test, Y_train, Y_test, background, feature_names, output +): + start = time.perf_counter() + # score = cross_validate(clf, X_train, Y_train, cv=10, return_train_score=True, return_estimator=True, n_jobs=-1, verbose=0, scoring=('roc_auc','neg_log_loss')) + # clf = score['estimator'][np.argmin(score['test_neg_log_loss'])] + # y_score = cross_val_predict(clf, X_train, Y_train, cv=5, n_jobs=-1, verbose=0) + # class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train) + clf.fit(X_train, Y_train) # , class_weight=class_weights) + clf_name = "StackingClassifier" + with open(f"../models/{var}/{clf_name}_{var}.joblib", "wb") as f: + dump(clf, f, compress="lz4") + # del clf + # with open(f"./models/{var}/{clf_name}_{var}.joblib", 'rb') as f: + # clf = load(f) + train_score = clf.score(X_train, Y_train) + y_score = clf.predict(X_test) + prc = precision_score(Y_test, y_score, average="weighted") + recall = recall_score(Y_test, y_score, average="weighted") + roc_auc = roc_auc_score(Y_test, y_score) + # roc_auc = roc_auc_score(Y_test, np.argmax(y_score, axis=1)) + accuracy = accuracy_score(Y_test, y_score) + # score = clf.score(X_train, Y_train) + # matrix = confusion_matrix(np.argmax(Y_test, axis=1), np.argmax(y_score, axis=1)) + matrix = confusion_matrix(Y_test, y_score) + finish = (time.perf_counter() - start) / 60 + with open(output, "a") as f: + f.write( + f"{clf_name}\t{train_score}\t{prc}\t{recall}\t{roc_auc}\t{accuracy}\t{finish}\n{matrix}\n" + ) + + # explain all the predictions in the test set + # background = shap.kmeans(X_train, 6) + explainer = shap.KernelExplainer(clf.predict, background) + del clf, X_train + background = X_test[np.random.choice(X_test.shape[0], 10000, replace=False)] + shap_values = explainer.shap_values(background) + plt.figure() + shap.summary_plot(shap_values, background, feature_names, show=False) + # shap.plots.waterfall(shap_values[0], max_display=15) + plt.savefig( + f"../models/{var}/{clf_name}_{var}_features.pdf", + format="pdf", + dpi=1000, + bbox_inches="tight", + ) + return None + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--var-tag", + "-v", + type=str, + required=True, + default="nssnv", + help="The tag used when generating train/test data. Default:'nssnv'", + ) + + args = parser.parse_args() + + # Classifiers I wish to use + classifiers = StackingClassifier( + estimators=[ + ("DecisionTree", DecisionTreeClassifier(class_weight="balanced")), + ( + "RandomForest", + RandomForestClassifier(class_weight="balanced", n_jobs=-1), + ), + ("BalancedRF", BalancedRandomForestClassifier()), + ("AdaBoost", AdaBoostClassifier()), + # ("ExtraTrees", ExtraTreesClassifier(class_weight='balanced', n_jobs=-1)), + ("GaussianNB", GaussianNB()), + # ("LDA", LinearDiscriminantAnalysis()), + ("GradientBoost", GradientBoostingClassifier()), + # ("MLP", MLPClassifier()) + ], + cv=5, + stack_method="predict_proba", + n_jobs=-1, + passthrough=False, + final_estimator=LogisticRegression(n_jobs=-1), + verbose=1, + ) + + with open("../../../configs/columns_config.yaml") as fh: + config_dict = yaml.safe_load(fh) + + var = args.var_tag + if not os.path.exists("../models/" + var): + os.makedirs("../models/" + var) + output = "../models/" + var + "/ML_results_" + var + "_.csv" + # print('Working with '+var+' dataset...', file=open(output, "a")) + print("Working with " + var + " dataset...") + X_train, X_test, Y_train, Y_test, background, feature_names = data_parsing( + var, config_dict, output + ) + with open(output, "a") as f: + f.write( + "Model\ttrain_score\tPrecision\tRecall\troc_auc\tAccuracy\tTime(min)\tConfusion_matrix[low_impact, high_impact]\n" + ) + classifier( + classifiers, + var, + X_train, + X_test, + Y_train, + Y_test, + background, + feature_names, + output, + ) + gc.collect() diff --git a/variant_annotation/.test/data/processed/vep/testing_variants_hg38_vep-annotated.vcf.gz b/variant_annotation/.test/data/processed/vep/testing_variants_hg38_vep-annotated.vcf.gz index b401d04..e4c5255 100644 Binary files a/variant_annotation/.test/data/processed/vep/testing_variants_hg38_vep-annotated.vcf.gz and b/variant_annotation/.test/data/processed/vep/testing_variants_hg38_vep-annotated.vcf.gz differ diff --git a/variant_annotation/README.md b/variant_annotation/README.md index 950ead4..105f9ad 100644 --- a/variant_annotation/README.md +++ b/variant_annotation/README.md @@ -6,14 +6,21 @@ Script [`src/run_pipeline.sh`](src/run_pipeline.sh) runs the snakemake workflow, ## Setup -1. Create necessary directories to store log files +1. Download submodules + +```sh +git submodule update --init +``` + + +2. Create necessary directories to store log files ```sh cd variant_annotation mkdir -p logs/rule_logs ``` -2. Create dataset config YAML and populate with paths +3. Create dataset config YAML and populate with paths ```sh touch ~/.ditto_datasets.yaml diff --git a/variant_annotation/src/Snakefile b/variant_annotation/src/Snakefile index 8f43a87..ba0980c 100644 --- a/variant_annotation/src/Snakefile +++ b/variant_annotation/src/Snakefile @@ -11,7 +11,7 @@ configfile: config["datasets"] #### VEP parameters #### -VEP_CACHE = 'homo_sapiens_refseq' +VEP_CACHE = 'homo_sapiens_merged' #'homo_sapiens_refseq' SPECIES = 'homo_sapiens' REF_BUILD = "GRCh38" ENSEMBL_DATASET_VERSION = "102" @@ -23,12 +23,12 @@ INPUT_VCF = config["vcf"] PROCESSED_DIR = Path(config["outdir"]) EXTERNAL_DIR = Path("data/external") -if not (INPUT_VCF.endswith('vcf') or INPUT_VCF.endswith('vcf.gz')): - print (f"Error: Input file extension not in expected format: found {INPUT_VCF}, expecting *.vcf or *.vcf.gz") +if not (INPUT_VCF.endswith('vcf') or INPUT_VCF.endswith('vcf.gz') or INPUT_VCF.endswith('vcf.bgz')): + print (f"Error: Input file extension not in expected format: found {INPUT_VCF}, expecting *.vcf, *.vcf.gz or *.vcf.bgz") raise SystemExit(1) INPUT_VCF = Path(INPUT_VCF) -OUTPUT_VCF = PROCESSED_DIR / ((INPUT_VCF.name).rstrip(".gz").rstrip(".vcf") + "_vep-annotated.vcf.gz") +OUTPUT_VCF = PROCESSED_DIR / ((INPUT_VCF.name).rstrip(".bgz").rstrip(".gz").rstrip(".vcf") + "_vep-annotated.vcf.gz") rule all: @@ -92,14 +92,15 @@ rule annotate_variants: release = ENSEMBL_DATASET_VERSION, species = SPECIES, build = REF_BUILD, - refseq_flag = "--refseq" if 'refseq' in VEP_CACHE else "", + #refseq_flag = "--refseq" if 'refseq' in VEP_CACHE else "", + refseq_flag = "--merged" if 'merged' in VEP_CACHE else "", hgvs_flag = "--hgvs" if HGVS else "", stats_flag = lambda wildcards, output: f"--stats_file {output.stats}" if STATS else "--no_stats", gnomad_fields = "AC,AN,AF,AF_afr,AF_afr_female,AF_afr_male,AF_ami,AF_ami_female,AF_ami_male,AF_amr,AF_amr_female,AF_amr_male,AF_asj,AF_asj_female," \ "AF_asj_male,AF_eas,AF_eas_female,AF_eas_male,AF_female,AF_fin,AF_fin_female,AF_fin_male,AF_male,AF_nfe,AF_nfe_female,AF_nfe_male," \ "AF_oth,AF_oth_female,AF_oth_male,AF_raw,AF_sas,AF_sas_female,AF_sas_male", clinvar_fields = "AF_ESP,AF_EXAC,AF_TGP,ALLELEID,CLNDN,CLNDNINCL,CLNDISDB,CLNDISDBINCL,CLNREVSTAT,CLNSIG,CLNSIGCONF,CLNSIGINCL,CLNVC,GENEINFO,MC,ORIGIN,RS,SSR", - dbNSFP_fields = "LRT_score,MutationTaster_score,MutationAssessor_score,FATHMM_score,PROVEAN_score,VEST4_score,MetaSVM_score,MetaLR_score,M-CAP_score," \ + dbNSFP_fields = "Ensembl_transcriptid,LRT_score,MutationTaster_score,MutationAssessor_score,FATHMM_score,PROVEAN_score,VEST4_score,MetaSVM_score,MetaLR_score,M-CAP_score," \ "CADD_phred,DANN_score,fathmm-MKL_coding_score,GenoCanyon_score,integrated_fitCons_score,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian," \ "phastCons100way_vertebrate,phastCons30way_mammalian,SiPhy_29way_logOdds,Eigen-raw_coding,Eigen-raw_coding_rankscore,Eigen-phred_coding," \ "Eigen-PC-raw_coding,Eigen-PC-raw_coding_rankscore,Eigen-PC-phred_coding", diff --git a/variant_annotation/src/run_pipeline.sh b/variant_annotation/src/run_pipeline.sh index 7661770..0a8a81e 100755 --- a/variant_annotation/src/run_pipeline.sh +++ b/variant_annotation/src/run_pipeline.sh @@ -5,7 +5,7 @@ #SBATCH --mem-per-cpu=4G #SBATCH --partition=medium -set -euo pipefail +set -eo pipefail usage() { echo "usage: $0" diff --git a/workflow/Snakefile b/workflow/Snakefile new file mode 100644 index 0000000..bd1dcd0 --- /dev/null +++ b/workflow/Snakefile @@ -0,0 +1,140 @@ +from pathlib import Path + +WORKFLOW_PATH = Path(workflow.basedir).parent + +configfile: "/data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/metadata/train_test_metadata_original.json" +PROCESSED_DIR = Path("data/processed/trial/filter_vcf_by_DP8_AB_hpo_removed") +EXOMISER_DIR = Path("/data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/exomiser/hpo_nonGeneticHPOsRemoved") +ANNOTATED_VCF_DIR = Path("/data/project/worthey_lab/projects/experimental_pipelines/mana/small_tasks/cagi6/rgp/data/processed/filter_vcf_by_DP8_AB") + +TRAIN_TEST = list(config.keys()) +SAMPLES = {} +SAMPLES['train'] = list(config['train'].keys()) +SAMPLES['test'] = list(config['test'].keys()) + +wildcard_constraints: + #sample="|".join(SAMPLE_LIST) #"TRAIN_12|TRAIN_13" + train_test = '|'.join(TRAIN_TEST), + sample = '|'.join(SAMPLES['train'] + SAMPLES['test']), + + +rule all: + input: + #TODO: specify all important output files here + #[PROCESSED_DIR / dataset_type / f"{sample}_vep-annotated_filtered.vcf.gz" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if config[dataset_type][sample]["affected_status"]=="Affected"], + [PROCESSED_DIR / dataset_type / sample / "ditto_predictions.csv" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if "PROBAND" in sample], + [PROCESSED_DIR / dataset_type / sample / "ditto_predictions_100.csv" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if "PROBAND" in sample], + [PROCESSED_DIR / dataset_type / sample / "combined_predictions.csv" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if "PROBAND" in sample], + [PROCESSED_DIR / dataset_type / sample / "combined_predictions_100.csv" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if "PROBAND" in sample], + [PROCESSED_DIR / dataset_type / sample / "combined_predictions_1000.csv" for dataset_type in TRAIN_TEST for sample in SAMPLES[dataset_type] if "PROBAND" in sample], + #expand(str(PROCESSED_DIR / "train/CAGI6_RGP_{sample}_PROBAND/predictions.csv"), sample=SAMPLE_LIST), + + +rule filter_variants: + input: + ANNOTATED_VCF_DIR / "{train_test}" / "{sample}_vep-annotated.vcf.gz" + output: + PROCESSED_DIR / "{train_test}" / "{sample}_vep-annotated_filtered.vcf.gz" + message: + "Filter variants from vcf using BCFTools: {wildcards.sample}" + conda: + str(WORKFLOW_PATH / "configs/envs/testing.yaml") + # threads: 2 + shell: + r""" + bcftools annotate \ + -e'ALT="*"' \ + {input} \ + -Oz \ + -o {output} + """ + + +rule parse_annotated_vars: + input: + PROCESSED_DIR / "{train_test}" / "{sample}_vep-annotated_filtered.vcf.gz" + output: + PROCESSED_DIR / "{train_test}" / "{sample}_vep-annotated_filtered.tsv" + message: + "Parse variants from annotated vcf to tsv: {wildcards.sample}" + conda: + str(WORKFLOW_PATH / "configs/envs/testing.yaml") + shell: + r""" + python annotation_parsing/parse_annotated_vars.py \ + -i {input} \ + -o {output} + """ + +rule ditto_filter: + input: + PROCESSED_DIR / "{train_test}" / "{sample}_vep-annotated_filtered.tsv" + output: + col = PROCESSED_DIR / "{train_test}" / "{sample}/columns.csv", + data = PROCESSED_DIR / "{train_test}" / "{sample}" / "data.csv", + nulls = PROCESSED_DIR / "{train_test}" / "{sample}/Nulls.csv", + stats = PROCESSED_DIR / "{train_test}" / "{sample}/stats_nssnv.csv", + plot = PROCESSED_DIR / "{train_test}" / "{sample}/correlation_plot.pdf", + message: + "Filter variants from annotated tsv for predictions: {wildcards.sample}" + conda: + str(WORKFLOW_PATH / "configs/envs/testing.yaml") + params: + outdir = lambda wildcards, output: Path(output['data']).parent + shell: + r""" + python src/Ditto/filter.py \ + -i {input} \ + -O {params.outdir} + """ + + +rule ditto_predict: + input: + data = PROCESSED_DIR / "{train_test}" / "{sample}/data.csv", + output: + pred = PROCESSED_DIR / "{train_test}" / "{sample}" / "ditto_predictions.csv", + pred_100 = PROCESSED_DIR / "{train_test}" / "{sample}" / "ditto_predictions_100.csv" + message: + "Run Ditto predictions: {wildcards.sample}" + params: + sample_name = lambda wildcards: f"{wildcards.sample}", + conda: + str(WORKFLOW_PATH / "configs/envs/testing.yaml") + shell: + r""" + python src/Ditto/predict.py \ + -i {input.data} \ + --sample {params.sample_name} \ + -o {output.pred} \ + -o100 {output.pred_100} \ + """ + +rule combine_scores: + input: + raw = PROCESSED_DIR / "{train_test}" / "{sample}_vep-annotated_filtered.tsv", + ditto = PROCESSED_DIR / "{train_test}" / "{sample}" / "ditto_predictions.csv", + exomiser = EXOMISER_DIR / "{train_test}" / "{sample}", + output: + pred = PROCESSED_DIR / "{train_test}" / "{sample}" / "combined_predictions.csv", + pred_100 = PROCESSED_DIR / "{train_test}" / "{sample}" / "combined_predictions_100.csv", + pred_1000 = PROCESSED_DIR / "{train_test}" / "{sample}" / "combined_predictions_1000.csv" + message: + "Combine Ditto predictions with Exomiser: {wildcards.sample}" + params: + sample_name = lambda wildcards: f"{wildcards.sample}", + conda: + str(WORKFLOW_PATH / "configs/envs/testing.yaml") + #params: + # variant= lambda wildcards: --variant str('Chr' + str(config[f"{wildcards.train_test}"][f"{wildcards.sample}"]["solves"][0]["Chrom"]) + ',' + str(config[f"{wildcards.train_test}"][f"{wildcards.sample}"]["solves"][0]["Pos"]) + ',' + config[f"{wildcards.train_test}"][f"{wildcards.sample}"]["solves"][0]["Ref"] + ',' + config[f"{wildcards.train_test}"][f"{wildcards.sample}"]["solves"][0]["Alt"]) if 'TRAIN' in {wildcards.sample} + shell: + r""" + python src/Ditto/combine_scores.py \ + --raw {input.raw} \ + --ditto {input.ditto} \ + -ep {input.exomiser} \ + --sample {params.sample_name} \ + -o {output.pred} \ + -o100 {output.pred_100} \ + -o1000 {output.pred_1000} \ + """