forked from ksamocha/de_novo_scripts
-
Notifications
You must be signed in to change notification settings - Fork 0
/
de_novo_finder_3.py
executable file
·1531 lines (1273 loc) · 59.5 KB
/
de_novo_finder_3.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python
'''
Find potential de novo variants in a given VCF.
de_novo_finder_3.py is a major update to the original caller. The main purpose is to
identify events that appear to be de novo in a specified VCF file that contains
sequence information from trios. Due to the nature of this variation, parents are required
to be homozygous reference, while children should usually be heterozygous for the
variant. Confidence in the calls is established with quality filters:
1) The variant must pass all of the filters applied by the variant caller
To accept TruthSensitivityTranche variants, use the -q flag.
** Removed in 3.93 **
2) The PL (normalized, Phred-scaled likelihoods for AA, AB, and BB genotypes where
A = ref and B = alt) score of the child is required to be >T : 0 : >0 for a given
threshold, T.
T is set at a default of 20. It can be adjusted with the -t flag.
3) The allelic balance (# alternative reads/total reads) of the child is required
to be at least 20%. The allelic balance in the parents should be less than or
equal to 5%.
These numbers can be adjusted with the -c and -p flags, respectively.
4) The depth in the child is required to be greater than a tenth of the sum of the
depths in both parents.
The fraction of the sum of depths can be adjusted with the -d flag.
This script processes both single nucleotide variants (SNVs) and small insertions and
deletions (indels). To skip indels, use the -i flag. ** Removed in 3.81 **
Lines in the VCF that have multiple alternative alleles are processed only if all
alleles are single bases.
Instead of requiring a hard PL threshold in the parents, we have defined a relative
probability of an event being truly de novo versus the probability that it was a missed
heterozygote call in one of the two parents (the most likely error mode).
p_dn = P(true de novo | data) / (P(true de novo | data) + P(missed het in parent | data))
where P(true de novo | data) = P(data | true de novo) * P(true de novo)
P(data | true de novo) = Pdad_ref * Pmom_ref * Pchild_het
P(true de novo) = 1/30 Mb
and P(missed het in parent | data) = P(data | at least one parent is het) * P(one parent het)
P(data | at least one parent is het) = (Pdad_ref*Pmom_het + Pdad_het*Pmom_ref) * Pchild_het
P(one parent het) = 1 - (1-f)^4
where f is the maximum of the frequency of the variant in the VCF or ESP
The minimum p_dn considered is 0.05, but can be adjusted with the -m flag.
The potential de novo variants are then split by SNVs and indels into HIGH, MEDIUM,
and LOW validation likelihood by the following criteria.
HIGH_SNV:
p_dn > 0.99 and child_AD > 0.3 and dp_ratio (kid depth/combined parental depth) > 0.2
or
p_dn > 0.99 and child_AD > 0.3 and allele count (AC) == 1
MEDIUM_SNV:
p_dn > 0.5 and child_AD > 0.3
or
p_dn > 0.5 and child_AD > 0.2 and AC == 1
LOW_SNV:
p_dn > 0.05 and child_AD > 0.2
HIGH_indel:
p_dn > 0.99 and child_AD > 0.3 and dp_ratio > 0.2
or
p_dn > 0.99 and child_AD > 0.3 and AC == 1
MEDIUM_indel:
p_dn > 0.5 and child_AD > 0.3
or
p_dn > 0.5 and child_AD > 0.2 and AC ==1
LOW_indel:
p_dn > 0.05 and child_AD > 0.2
If SnpEff annotations have been included in the annotation line of the VCF, the -a
flag can be used to extract and print the gene name and mutation category. The same is
true for VEP annotations using the -v flag.
The updates to this caller require changes in the input. A PED file is required to
establish the family relations. PED format has 6 columns: family ID, child, dad, mom,
sex of the child, affected status of the child. The ESP counts file is required to
determine the population frequency of an event.
Current ESP counts file: all_ESP_counts_5.28.13.txt
Output contains many more columns than earlier versions of the script so downstream
scripts may need to be adjusted.
3.4: different flags
3.5: fixed depth bug and now prints DP_ratio, child sex and affected status
3.52: modified to remove "chr" if it is in the "chr" column and skip lines
that have no PL information
3.6: Changed the INCORRECT MEDIUM_indel variant AC requirement. It used
to be AC >= 5, but it should really be AC <= 5. (This may been seen in
some versions of 3.52.) In addtion, 3.6 skips lines with no AD information
and keeps track of the number.
3.7: Adjusted validation likelihood filters so that MEDIUM_SNV that meet
the following criteria are moved to HIGH_SNV:
AC < 10
child AD > 0.3
child depth > 10
3.71: Added ability to handle .gz VCFs
3.72: Haplotype caller for 0/0 individuals will list an AD of "." which breaks the
script. I am now assuming that there are 0 alternative reads in these cases.
3.73: Haplotype caller now for the hemizygous variants
3.74: Fixed the labels reading line so that it strips off the newline character
3.75: Added the ability to handle VEP annotations (curtesy of Jack Kosmicki)
3.8: The VEP annotation section was completely wrong. Rewrote with code from Konrad
Karczewski loftee_utils.py (https://github.com/konradjk/loftee/blob/master/src/loftee_utils.py)
3.9: DROPPED THE -i FLAG. The script does not throw out multiallelic lines where a SNV
and indel are present. It, however, still does not handle multialleleic lines with more
than two alternative alleles.
3.91: Slight update to VEP annotation list
3.92: Adding a quick fix to avoid situations where SnpEff where EFFECT (such as
'INTRON') is missing
3.93: Removed the -q flag that allowed individuals to look at Truth Sensitivity
Tranche variants
3.94: Had never incorporated that if the greatest frequency (f) == 0, then it
becomes 100/30Mbp
'''
__version__ = 3.94
__author__ = 'Kaitlin E. Samocha <ksamocha@fas.harvard.edu>'
__date__ = 'March 10th, 2016'
import argparse
import os.path
import sys
import re
import time
import gzip
# Note that this list of VEP annotations is current as of v77 with 2 included for backwards compatibility (VEP <= 75)
# From Konrad Karczewski
# Slight updates from Jack Kosmicki
csq_order = ["transcript_ablation",
"splice_donor_variant",
"splice_acceptor_variant",
"stop_gained",
"frameshift_variant",
"stop_lost",
"start_lost",
"initiator_codon_variant", #deprecated
"transcript_amplification",
"inframe_insertion",
"inframe_deletion",
"missense_variant",
"protein_altering_variant",
"splice_region_variant",
"incomplete_terminal_codon_variant",
"stop_retained_variant",
"synonymous_variant",
"coding_sequence_variant",
"mature_miRNA_variant",
"5_prime_UTR_variant",
"3_prime_UTR_variant",
"non_coding_transcript_exon_variant",
"non_coding_exon_variant", # deprecated
"intron_variant",
"NMD_transcript_variant",
"non_coding_transcript_variant",
"nc_transcript_variant", # deprecated
"upstream_gene_variant",
"downstream_gene_variant",
"TFBS_ablation",
"TFBS_amplification",
"TF_binding_site_variant",
"regulatory_region_ablation",
"regulatory_region_amplification",
"feature_elongation",
"regulatory_region_variant",
"feature_truncation",
"intergenic_variant",
""]
csq_order_dict = dict(zip(csq_order, range(len(csq_order))))
rev_csq_order_dict = dict(zip(range(len(csq_order)), csq_order))
def trimfamily(Fam, labels):
"Trim family dictionary to leave only families that are in the VCF"
Fam2 = {}
for child, (dad, mom, gender, aff_status) in Fam.items():
if child not in labels:
sys.stderr.write("Could not find child: {0}\n".format(child))
continue
if dad not in labels:
sys.stderr.write("Could not find dad: {0}\n".format(dad))
continue
if mom not in labels:
sys.stderr.write("Could not find mom: {0}\n".format(mom))
continue
if gender in ('Male', 'male', 'M', 'm', '1'):
gender = 1
elif gender in ('Female', 'female', 'F', 'f', '2'):
gender = 2
else:
gender = 0 # consider it as missing
child = labels.index(child)
dad = labels.index(dad)
mom = labels.index(mom)
Fam2[child] = (dad, mom, gender, aff_status)
# Now to make the arrays for easier look up
am_kid = ['N' for i in range(9, len(labels))]
who_dad = ['N' for i in range(9, len(labels))]
who_mom = ['N' for i in range(9, len(labels))]
for idx in range(9, len(labels)):
if idx in Fam2.keys():
am_kid[idx-9] = 'Y'
(dad_pos, mom_pos, gender, aff_status) = Fam2[idx]
who_dad[idx-9] = dad_pos
who_mom[idx-9] = mom_pos
return (Fam2, am_kid, who_dad, who_mom)
def split_Fam(Fam_dict, labels):
"Split the family dictionary into a female vs male children"
fem_Fam = {}
male_Fam = {}
female_kid = ['N' for i in range(9, len(labels))]
male_kid = ['N' for i in range(9, len(labels))]
for family, fam_info in Fam_dict.items():
if fam_info[2] == 1: # male
male_Fam[family] = fam_info
elif fam_info[2] == 2: # female
fem_Fam[family] = fam_info
else:
continue
for idx in range(9, len(labels)):
if idx in fem_Fam.keys():
female_kid[idx-9] = 'Y'
elif idx in male_Fam.keys():
male_kid[idx-9] = 'Y'
else:
continue
return (fem_Fam, female_kid, male_Fam, male_kid)
def process_line(line, args):
'''Processes the variant line and runs quality checks on the variant
May not need to be its own function given the removal of -q in 3.93
'''
(ref_allele, alt_allele, qual, filter_pos) = range(3, 7)
passedcheck = True
if line[filter_pos] != 'PASS':
passedcheck = False
return passedcheck
def is_child(column, Fam):
"Checks if the het entry is a child"
if column in Fam.keys():
return (Fam[column][0], Fam[column][1]) # Should return the parents
return None
def child_cuts(record, PL_pos, AD_pos, args):
"Apply the PL and AD filters"
# To skip lines that have no PL information
try:
PL = record[PL_pos].split(",")
except IndexError:
return None
if int(PL[0]) <= args.thresh:
return None
if PL[1] != "0":
return None
if record[AD_pos] == '.':
sys.stderr.write('Child had AD of ".": {0}\n'.format(record))
return None
AD = record[AD_pos].split(',') # should be for hets only. not modifying
if ((AD[0] == '0') and (AD[1] == '0')):
return None
ratio = float(AD[1]) / (float(AD[0]) + float(AD[1]))
if ratio <= args.minchildAB:
return None
return (PL, ratio)
def DPcheck(DPlist, args): # revamped
"Check that the child's depth is appropriate given the parents' depths"
percent_depthratio = args.depthratio / 100.0
dp_ratio = float(DPlist[0]) / (float(DPlist[1]) + float(DPlist[2]))
if dp_ratio >= percent_depthratio:
return dp_ratio
else:
return None
def snpeff_annotate(var_annotation):
"Take variant annotation and extract gene name and mutation type"
# Set for SnpEff annotation
var_annotation = var_annotation.split(';')
gene_name = "."
functional_class = "."
effect = "."
for entry in var_annotation:
if re.search('GENE_NAME', entry):
gene_name = entry.split('=')[1]
elif re.search('FUNCTIONAL_CLASS', entry):
functional_class = entry.split('=')[1]
elif re.search('EFFECT', entry):
effect = entry.split('=')[1]
else:
continue
if functional_class in ('NONSENSE', 'MISSENSE', 'SILENT'):
return (gene_name, functional_class)
elif (functional_class == 'NONE'):
return (gene_name, effect)
else:
return (gene_name, 'NA')
def VEP_annotate(var_annotation, vep_field_names, alt_allele):
'''Take variant annotation and extract gene name and mutation type
Set for VEP annotation, which has all the crazy pipes
Based on code from Konrad Karczewski
'''
gene_name = "."
functional_class = "."
info_field = dict([(x.split('=', 1)) if '=' in x else (x, x) for x in re.split(';(?=\w)', var_annotation)])
if 'CSQ' not in info_field:
return (gene_name, functional_class)
# array with dictionaries containing the information
annotations = [dict(zip(vep_field_names, x.split('|'))) for x in info_field['CSQ'].split(',') if len(vep_field_names) == len(x.split('|'))]
# loop through and choose the canonical annotation
# check that alternative allele matches
for entry in annotations:
if entry['Allele'] != alt_allele:
continue
if entry['CANONICAL'] != 'YES':
continue
gene_name = entry['SYMBOL']
entry['major_consequence'] = worst_csq_from_csq(entry['Consequence'])
functional_class = entry['major_consequence']
# If there is no canonical transcript, return worst consequence
# Code taken from loftee_utils.py by Konrad Karczewski
if gene_name == '.':
worst_annotation = worst_csq_with_vep(annotations)
if worst_annotation != None:
gene_name = worst_annotation['SYMBOL']
functional_class = worst_annotation['major_consequence']
return (gene_name, functional_class)
def worst_csq_with_vep(annotation_list):
"""
Takes list of VEP annotations [{'Consequence': 'frameshift', Feature: 'ENST'}, ...]
Returns most severe annotation (as full VEP annotation [{'Consequence': 'frameshift', Feature: 'ENST'}])
Also tacks on worst consequence for that annotation (i.e. worst_csq_from_csq)
:param annotation_list:
:return worst_annotation:
"""
if len(annotation_list) == 0: return None
worst = annotation_list[0]
for annotation in annotation_list:
if compare_two_consequences(annotation['Consequence'], worst['Consequence']) < 0:
worst = annotation
elif compare_two_consequences(annotation['Consequence'], worst['Consequence']) == 0 and annotation['CANONICAL'] == 'YES':
worst = annotation
worst['major_consequence'] = worst_csq_from_csq(worst['Consequence'])
return worst
def compare_two_consequences(csq1, csq2):
'From Konrad Karczewski'
if csq_order_dict[worst_csq_from_csq(csq1)] < csq_order_dict[worst_csq_from_csq(csq2)]:
return -1
elif csq_order_dict[worst_csq_from_csq(csq1)] == csq_order_dict[worst_csq_from_csq(csq2)]:
return 0
return 1
def worst_csq_from_csq(csq):
"""
Input possibly &-filled csq string (e.g. 'non_coding_exon_variant&nc_transcript_variant')
Return the worst annotation (In this case, 'non_coding_exon_variant')
:param consequence:
:return most_severe_consequence:
From Konrad Karczewski
"""
return rev_csq_order_dict[worst_csq_index(csq.split('&'))]
def worst_csq_index(csq_list):
"""
Input list of consequences (e.g. ['frameshift_variant', 'missense_variant'])
Return index of the worst annotation (In this case, index of 'frameshift_variant', so 4)
Works well with csqs = 'non_coding_exon_variant&nc_transcript_variant' by worst_csq_index(csqs.split('&'))
:param annnotation:
:return most_severe_consequence_index:
From Konrad Karczewski
"""
return min([csq_order_dict[ann] for ann in csq_list])
def parent_AD_cuts(dad_AD_info, mom_AD_info, args):
"Apply the AD filter to the parental data"
if dad_AD_info == '.':
dad_AD_ratio = 0.0
else:
dad_AD = dad_AD_info.split(',')
if ((dad_AD[0] == '0') and (dad_AD[1] == '0')):
return None
dad_AD_ratio = float(dad_AD[1])/(float(dad_AD[0]) + float(dad_AD[1]))
if mom_AD_info == '.':
mom_AD_ratio = 0.0
else:
mom_AD = mom_AD_info.split(',')
if ((mom_AD[0] == '0') and (mom_AD[1] == '0')):
return None
mom_AD_ratio = float(mom_AD[1])/(float(mom_AD[0]) + float(mom_AD[1]))
if ((args.maxparentAB <= dad_AD_ratio) or
(args.maxparentAB <= mom_AD_ratio)):
return None
return (dad_AD_ratio, mom_AD_ratio)
def load_esp_counts(esp_file, chrom):
"Open the ESP counts file and save variants for a given chromosome"
esp_counts = {}
(count_ea, numchr_ea, af_ea, count_aa, numchr_aa, af_aa) = range(9,15)
with open(esp_file, 'r') as esp_data:
for line in esp_data:
line = line.split()
if line[0] != chrom:
continue
# key format -- chr:pos:ref:alt
chr_pos_change = '{0}:{1}:{2}:{3}'.format(line[0], line[1],
line[2], line[3])
# value format -- frequency
allele_count = float(line[count_ea]) + float(line[count_aa])
chr_count = float(line[numchr_ea]) + float(line[numchr_aa])
allele_freq = allele_count/chr_count
esp_counts[chr_pos_change] = allele_freq
return esp_counts
def determine_validation_likelihood(ref, alt, child_AD, annotation, p_dn, dp_ratio, child_dp):
'''Determine the likelihood of a de novo variant validating (HIGH, MEDIUM,
LOW) split by SNVs and indels'''
qual_flag = 'None'
variant_AC = annotation.split(';')[0].split('=')[1]
# Indels
if (ref not in ('A', 'C', 'G', 'T')) or (alt not in ('A', 'C', 'G', 'T')):
if (p_dn > 0.99) and (child_AD > 0.3) and (variant_AC == '1'):
qual_flag = 'HIGH_indel'
elif (p_dn > 0.5) and (child_AD > 0.3) and (float(variant_AC) <= 5):
qual_flag = 'MEDIUM_indel'
elif (p_dn > 0.05) and (child_AD > 0.2):
qual_flag = 'LOW_indel'
# SNVs
else:
if (p_dn > 0.99) and (child_AD > 0.3) and (dp_ratio > 0.2):
qual_flag = 'HIGH_SNV'
elif (p_dn > 0.99) and (child_AD > 0.3) and (variant_AC == '1'):
qual_flag = 'HIGH_SNV'
# Added to move some MEDIUM variants into HIGH
elif (p_dn > 0.5) and (child_AD >= 0.3) and (float(variant_AC) < 10) and (float(child_dp) >= 10):
qual_flag = 'HIGH_SNV'
elif (p_dn > 0.5) and (child_AD > 0.3):
qual_flag = 'MEDIUM_SNV'
elif (p_dn > 0.5) and (child_AD > 0.2) and (variant_AC == '1'):
qual_flag = 'MEDIUM_SNV'
elif (p_dn > 0.05) and (child_AD > 0.2):
qual_flag = 'LOW_SNV'
return qual_flag
def get_variant_freq(esp_chr_counts, chr_pos_change, variant_annotation):
"Determine the frequency of alternative alleles at the site"
esp_freq = 0.0
vcf_freq = 0.0
# Determine allele frequency if found in ESP
if chr_pos_change in esp_chr_counts.keys():
#sys.stderr.write('Found in ESP: {0}\n'.format(chr_pos_change))
esp_freq = esp_chr_counts[chr_pos_change]
# Determine VCF allele frequency (divide AC-1 by AN)
found_counter = 0
all_annotation = variant_annotation.split(';')
for entry in all_annotation:
entry = entry.split('=')
if entry[0] == 'AC':
allele_count = float(entry[1])
found_counter += 1
elif entry[0] == 'AN':
allele_num = float(entry[1])
found_counter += 1
if found_counter >= 2:
break
try:
vcf_freq = (allele_count-1)/allele_num
except UnboundLocalError:
sys.exit('What is wrong: {0}\n'.format(variant_annotation))
# If both esp_freq and vcf_freq are 0, f = 100/30Mbp
if (esp_freq==0.0) and (vcf_freq==0.0):
return (100.0/30000000)
# Return the greater of the two allele frequencies
if esp_freq < vcf_freq:
return vcf_freq
else:
return esp_freq
def transform_PL_to_prob(PLs):
"Take the list of PLs and transform into probability of observing variant"
# PLs are weird and have to be adjusted like so:
# P_ref = 10^(-PLref/10)/(10^(-PLref/10) + 10^(-PLhet/10) + 10^(-PLalt/10))
adj_PL_ref = 10**(-float(PLs[0])/10)
adj_PL_het = 10**(-float(PLs[1])/10)
adj_PL_alt = 10**(-float(PLs[2])/10)
sum_adj_PLs = adj_PL_ref + adj_PL_het + adj_PL_alt
P_ref = adj_PL_ref/sum_adj_PLs
P_het = adj_PL_het/sum_adj_PLs
P_alt = adj_PL_alt/sum_adj_PLs
return (P_ref, P_het, P_alt)
def get_prob_true_dn(child_PL, dad_PL, mom_PL, variant_pop_freq):
'''Determine the relative probabilities of a true de novo vs missed het in
parents'''
# metric = p(de novo|data)/(p(de novo|data) + p(missed het in parent|data))
# p(de novo|data) = P_dadref*P_momref*P_kidhet*(1/30Mb)
# p(mhip|data) = ((P_dadhet*Pmomref + P_momhet*Pdadref)*P_kidhet)
# * (1-(1-f)^4)
# Fist step: transform all the PLs back into probabilities
(child_P_ref, child_P_het, child_P_alt) = transform_PL_to_prob(child_PL)
(dad_P_ref, dad_P_het, dad_P_alt) = transform_PL_to_prob(dad_PL)
(mom_P_ref, mom_P_het, mom_P_alt) = transform_PL_to_prob(mom_PL)
# Determine p(de novo | data) -- 1 in 30Mbp is what we expect for dn rate
p_dn_data = dad_P_ref*mom_P_ref*child_P_het*(1.0/30000000)
# Determine p(missed het in parent | data) -- split for clarity
p_data_onehet = (dad_P_het*mom_P_ref + dad_P_ref*mom_P_het)*child_P_het
p_oneparent_het = 1-((1-variant_pop_freq)**4)
p_mhip_data = p_data_onehet*p_oneparent_het
# Determine the new metric
metric = p_dn_data/(p_dn_data + p_mhip_data)
return metric
def process_autosome_variant(line, Fam, PL_pos, AD_pos, DP_pos, args,
esp_chr_counts, labels, chrom_under_study,
am_kid, who_dad, who_mom, vep_field_names):
"Go through the VCF line by line to find de novo variants"
for column, entry in enumerate(line):
if not entry.startswith(('0/1','1/0')):
continue
# If a het site has been found, check if the het is a child
if am_kid[column-9] == 'N':
continue
else:
dad_pos = who_dad[column-9]
mom_pos = who_mom[column-9]
# Make sure the het variant passes the quality filters
child_data = child_cuts(entry.split(':'), PL_pos, AD_pos, args)
if child_data is None:
continue
else:
(child_PL, child_AD_ratio) = child_data
# Check that the parents are both homozygous reference
if not line[dad_pos].startswith('0/0'):
continue
if not line[mom_pos].startswith('0/0'):
continue
dad_record = line[dad_pos].split(':')
mom_record = line[mom_pos].split(':')
dad_PL = dad_record[PL_pos].split(',')
mom_PL = mom_record[PL_pos].split(',')
dad_AD_info = dad_record[AD_pos]
mom_AD_info = mom_record[AD_pos]
# Make sure that both parent genotypes pass AD and DP filters
parent_AD_ratios = parent_AD_cuts(dad_AD_info, mom_AD_info, args)
if parent_AD_ratios is None:
continue
else:
(dad_AD_ratio, mom_AD_ratio) = parent_AD_ratios
child_dp = entry.split(':')[DP_pos]
DP = [child_dp,
dad_record[DP_pos],
mom_record[DP_pos]]
dp_ratio = DPcheck(DP, args)
if dp_ratio is None:
continue
# Start of the new block with allele frequencies
variant_position = line[1]
ref_allele = line[3]
alt_allele = line[4]
variant_annotation = line[7]
chr_pos_change = '{0}:{1}:{2}:{3}'.format(chrom_under_study,
variant_position,
ref_allele, alt_allele)
# Find the population frequency of the variant
# Max of ESP frequency and frequency in the VCF
variant_pop_freq = get_variant_freq(esp_chr_counts,
chr_pos_change,
variant_annotation)
if variant_pop_freq == 0:
variant_pop_freq = (100.0/30000000)
# Rough expected number of het sites not in ESP
# Establish the chance that this is a true de novo event
# using PL scores and variant population frequency
prob_true_dn = get_prob_true_dn(child_PL, dad_PL, mom_PL,
variant_pop_freq)
if prob_true_dn < args.pdnmetric:
continue
# Extract child sex and affected status
(dad_col, mom_col, child_sex, child_aff_status) = Fam[column]
qual_flag = determine_validation_likelihood(ref_allele, alt_allele,
child_AD_ratio,
variant_annotation,
prob_true_dn, dp_ratio,
child_dp)
if args.annotatevar:
(var_gene, var_category) = snpeff_annotate(variant_annotation)
res_indiv = [
line[0], line[1], line[2], line[3], line[4],
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, var_gene,
var_category, qual_flag, variant_annotation
]
elif args.annotatevar_VEP:
(var_gene, var_category) = VEP_annotate(variant_annotation, vep_field_names, line[4]) #alt allele also provided
res_indiv = [
line[0], line[1], line[2], line[3], line[4],
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, var_gene,
var_category, qual_flag, variant_annotation
]
else:
res_indiv = [
line[0], line[1], line[2], line[3], line[4],
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, qual_flag,
line[7]
]
print('\t'.join(map(str, res_indiv)))
def process_multi_variant(line, Fam, PL_pos, AD_pos, DP_pos, args,
esp_chr_counts, labels, chrom_under_study,
am_kid, who_dad, who_mom, vep_field_names):
'''Go through the VCF line by line to find de novo variants on lines with
multiple alt alleles'''
for column, entry in enumerate(line):
if not entry.startswith(('0/1', '0/2')):
continue
# If a het site has been found, check if the het is a child
if am_kid[column-9] == 'N':
continue
else:
dad_pos = who_dad[column-9]
mom_pos = who_mom[column-9]
# Check that the parents are both homozygous reference -- logic moved
if not line[dad_pos].startswith('0/0'):
continue
if not line[mom_pos].startswith('0/0'):
continue
dad_record = line[dad_pos].split(':')
mom_record = line[mom_pos].split(':')
# Test if the PL is there
try:
d_PL = dad_record[PL_pos].split(',')
except IndexError:
continue
# Replacing the PL and AD information for 0/2
# Before 0/2 : R,A1,A2 : DP : GQ : RR,RA1,A1A1,RA2,A1A2,A2A2
# After 0/2 : R,A2 : DP : GQ : RR,RA2,A2A2
if entry.startswith('0/2'):
# Fix child
new_entry = entry.split(':')
k_AD = new_entry[AD_pos].split(',')
new_entry[AD_pos] = ','.join([k_AD[0],k_AD[2]])
k_PL = new_entry[PL_pos].split(',')
new_entry[PL_pos] = ','.join([k_PL[0], k_PL[3], k_PL[5]])
entry = ':'.join(new_entry)
# Fix dad
if dad_record[AD_pos] == '.':
dad_record[AD_pos] = '{0},0'.format(dad_record[DP_pos])
else:
d_AD = dad_record[AD_pos].split(',')
dad_record[AD_pos] = ','.join([d_AD[0],d_AD[2]])
d_PL = dad_record[PL_pos].split(',')
dad_record[PL_pos] = ','.join([d_PL[0], d_PL[3], d_PL[5]])
# Fix mom
if mom_record[AD_pos] == '.':
mom_record[AD_pos] = '{0},0'.format(mom_record[DP_pos])
else:
m_AD = mom_record[AD_pos].split(',')
mom_record[AD_pos] = ','.join([m_AD[0],m_AD[2]])
m_PL = mom_record[PL_pos].split(',')
mom_record[PL_pos] = ','.join([m_PL[0], m_PL[3], m_PL[5]])
# Make sure the het variant passes the quality filters
child_data = child_cuts(entry.split(':'), PL_pos, AD_pos, args)
if child_data is None:
continue
else:
(child_PL, child_AD_ratio) = child_data
dad_PL = dad_record[PL_pos].split(',')
mom_PL = mom_record[PL_pos].split(',')
dad_AD_info = dad_record[AD_pos]
mom_AD_info = mom_record[AD_pos]
# Make sure that both parent genotypes pass AD and DP filters
parent_AD_ratios = parent_AD_cuts(dad_AD_info, mom_AD_info, args)
if parent_AD_ratios is None:
continue
else:
(dad_AD_ratio, mom_AD_ratio) = parent_AD_ratios
child_dp = entry.split(':')[DP_pos]
DP = [child_dp,
dad_record[DP_pos],
mom_record[DP_pos]]
dp_ratio = DPcheck(DP, args)
if dp_ratio is None:
continue
# Start of the new block with allele frequencies
variant_position = line[1]
ref_allele = line[3]
alt_alleles = line[4].split(',')
variant_annotation_s = line[7].split(';')
v_AC = variant_annotation_s[0].split(',')
# AC should be the 1st position of annotation, "AC=1,2" -> "AC=1" and "2"
if entry.startswith('0/2'):
alt_allele = alt_alleles[1]
variant_annotation_s[0] = 'AC=' + v_AC[1] # to make "AC=2"
else:
alt_allele = alt_alleles[0]
variant_annotation_s[0] = v_AC[0] # should be "AC=1"
variant_annotation = ';'.join(variant_annotation_s)
chr_pos_change = '{0}:{1}:{2}:{3}'.format(chrom_under_study,
variant_position,
ref_allele, alt_allele)
# Find the population frequency of the variant
# Max of ESP frequency and frequency in the VCF
variant_pop_freq = get_variant_freq(esp_chr_counts,
chr_pos_change,
variant_annotation)
if variant_pop_freq == 0:
variant_pop_freq = (100.0/30000000)
# Rough expected number of het sites not in ESP
# Establish the chance that this is a true de novo event
# using PL scores and variant population frequency
prob_true_dn = get_prob_true_dn(child_PL, dad_PL, mom_PL,
variant_pop_freq)
if prob_true_dn < args.pdnmetric:
continue
# Extract child sex and affected status
(dad_col, mom_col, child_sex, child_aff_status) = Fam[column]
qual_flag = determine_validation_likelihood(ref_allele, alt_allele,
child_AD_ratio,
variant_annotation, prob_true_dn,
dp_ratio, child_dp)
if args.annotatevar:
(var_gene, var_category) = snpeff_annotate(variant_annotation)
if ',' in var_gene:
var_genes = var_gene.split(',')
if entry.startswith('0/2'):
var_gene = var_genes[1]
else:
var_gene = var_genes[0]
elif ',' in var_category:
var_cats = var_category.split(',')
if entry.startswith('0/2'):
var_category = var_cats[1]
else:
var_category = var_cats[0]
res_indiv = [
line[0], line[1], line[2], line[3], alt_allele,
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, var_gene,
var_category, qual_flag,line[7]
]
elif args.annotatevar_VEP:
(var_gene, var_category) = VEP_annotate(variant_annotation, vep_field_names, alt_allele) #alt allele also provided
if ',' in var_gene:
var_genes = var_gene.split(',')
if entry.startswith('0/2'):
var_gene = var_genes[1]
else:
var_gene = var_genes[0]
elif ',' in var_category:
var_cats = var_category.split(',')
if entry.startswith('0/2'):
var_category = var_cats[1]
else:
var_category = var_cats[0]
res_indiv = [
line[0], line[1], line[2], line[3], alt_allele,
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, var_gene,
var_category, qual_flag,line[7]
]
else:
res_indiv = [
line[0], line[1], line[2], line[3], alt_allele,
labels[column], labels[dad_pos], labels[mom_pos],
child_sex, child_aff_status, child_PL[0], dad_PL[1],
mom_PL[1], child_AD_ratio, dad_AD_ratio, mom_AD_ratio,
DP[0], DP[1], DP[2], dp_ratio, prob_true_dn, qual_flag,
line[7]
]
print('\t'.join(map(str, res_indiv)))
def process_hemizygous_variants(line, Fam, PL_pos, AD_pos, DP_pos, args,
esp_chr_counts, labels, chrom_under_study,
parent, gender_kid, who_parent, vep_field_names):
"Look for de novo variants when the chromosome is hemizygous"
for column, entry in enumerate(line):
if not entry.startswith('1/1'):
continue
# Check if the column is a kid's
if gender_kid[column-9] == 'N':
continue
else:
par_pos = who_parent[column-9]
# Only keep lines where and parents are ref
if not line[par_pos].startswith('0/0'):
continue
child_record = entry.split(':')
par_record = line[par_pos].split(':')
# Expect child to be homozygous alternative
child_PL = child_record[PL_pos].split(',')
if child_PL[2] != '0':
continue
if int(child_PL[1]) <= args.thresh:
continue
child_AD = child_record[AD_pos].split(',')
if ((child_AD[0] == '0') and (child_AD[1] == '0')):
continue
child_AD_ratio = float(child_AD[1]) / (float(child_AD[0]) +
float(child_AD[1]))
if child_AD_ratio <= 0.95:
continue
# Check that parent's reads match homozygous reference
if par_record[AD_pos] == '.':
par_AD_ratio = 0.0
else:
par_AD = par_record[AD_pos].split(',')
if ((par_AD[0] == '0') and (par_AD[1] == '0')):
continue
par_AD_ratio = float(par_AD[1]) / (float(par_AD[0]) +
float(par_AD[1]))
if par_AD_ratio >= 0.05:
continue
par_PL = par_record[PL_pos].split(',')
# Depth filter
child_dp = child_record[DP_pos]
dp_ratio = float(child_dp)/float(par_record[DP_pos])
percent_depthratio = args.depthratio/100.0
if dp_ratio <= percent_depthratio:
continue
# Find variant information
variant_position = line[1]
ref_allele = line[3]
alt_allele = line[4]
variant_annotation = line[7]
chr_pos_change = '{0}:{1}:{2}:{3}'.format(chrom_under_study,
variant_position,
ref_allele, alt_allele)
# Establish the chance that this is a true de novo event
# using PL scores and variant population frequency
# Fist step: transform all the PLs back into probabilities
(child_P_ref, child_P_het, child_P_alt) = transform_PL_to_prob(child_PL)
(par_P_ref, par_P_het, par_P_alt) = transform_PL_to_prob(par_PL)
# Determine p(de novo | data) and part of p(missed alt | data)
p_dn_data = par_P_ref*child_P_alt*(1.0/30000000)
p_data_missedcall = (par_P_het + par_P_alt)*child_P_alt
# Find the population frequency of the variant