Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear criteria for when merge sets genotypes missing vs ref homoz #1891

Closed
jcm6t opened this issue Mar 23, 2023 · 5 comments
Closed

Unclear criteria for when merge sets genotypes missing vs ref homoz #1891

jcm6t opened this issue Mar 23, 2023 · 5 comments

Comments

@jcm6t
Copy link

jcm6t commented Mar 23, 2023

Thanks for supporting these tools.

bcftools 1.16, file formats are VCF 4.2.

I have merged 1379 gcvf files using the following pipeline:

$ bcftools merge --merge none -f PASS,. --gvcf $REFGENOME -i DP:sum,MQ:avg,DQUAL:min -r chr22  --file-list $gvcflistfile -Ou | \
$ bcftools norm -m -any  -Ou | \
$ bcftools view  -e 'ALT="<NON_REF>"' -Ou | \
$ bcftools annotate -x FORMAT $THREADSOPT -Ob > $outfile

These are the top few variants in the chr22 run and first 2 samples:

POS	REF	ALT	QUAL	FILTER	FORMAT	77900074	77900324
10510105	T	A	8.33	PASS	GT	./.	./.
10510113	C	T	17.93	PASS	GT	./.	./.
10510212	A	T	13.59	PASS	GT	./.	./.
10510258	C	T	41.37	PASS	GT	./.	./.
10510341	C	G	6.8	PASS	GT	./.	0/0
10510352	AT	A	15.8	PASS	GT	./.	./.
10510352	ATAAT	A	10.95	PASS	GT	./.	./.
10510355	AT	A	12.48	PASS	GT	./.	./.
10510356	T	A	46.64	PASS	GT	./.	1/1

if we focus on the 2nd sample 77900324, we see that first 4 GTs are missing ./. while the 5th at 10510341 is called 0/0. I'm trying to rationalize why. One of the key issues in multiplex gvcf merging is to utilize confidence in deciding between ./. and 0/0 for rare variants at positions that were not variable in the individual gvcfs, rather than assuming they are all 0/0 because they weren't variable and therefore are more likely to ref homozyg. Note also that I have chosen this example carefully to remove the multi-allelic issues - the positions I am referring to were unique and were SNV biallelic (but I note the split alleles at 10510352).

The block for the first merged variant at 10510105 in the individual 77900324 gvcf is:

chr22	10510083	.	A	<NON_REF>	.	PASS	END=10510105	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:6,0:6:12:4:0,12,141:0,12,141:6,0

ie, this PASSED in the individual gcvf with a GQ of 12, 0/0. Similarly the next 3.

The first called homozygous REF variant in this sample was at 10510341 corresponding to this block:

chr22	10510341	.	C	<NON_REF>	.	PASS	END=10510341	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:23,1:24:57:24:0,57,933:0,57,255:22,0

This PASSED in the gvcf with a GQ of 57, 0/0.

I would have expected all of the top 5 variants to be called 0/0 in the merged bcf file. Is merge applying hidden quality/depth threshold ?

Thanks for your insights.

(I have reproduced the first lines of the gvcf for the sample below for reference:)

chr22	10510000	.	N	<NON_REF>	.	LowDepth;LowGQ	END=10510000	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	./.:0,1:1:0:1:0,0,0:42,3,0:0,0
chr22	10510001	.	G	<NON_REF>	.	LowDepth	END=10510037	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:1,0:1:3:1:0,3,15:0,3,15:1,0
chr22	10510038	.	T	<NON_REF>	.	LowDepth	END=10510082	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:1,0:1:3:1:0,3,15:0,3,15:3,0
chr22	10510083	.	A	<NON_REF>	.	PASS	END=10510105	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:6,0:6:12:4:0,12,141:0,12,141:6,0
chr22	10510106	.	T	<NON_REF>	.	PASS	END=10510117	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:8,0:8:21:7:0,21,254:0,21,254:9,0
chr22	10510118	.	A	<NON_REF>	.	PASS	END=10510175	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:12,0:12:30:10:0,30,391:0,30,255:13,0
chr22	10510176	.	T	<NON_REF>	.	PASS	END=10510249	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:16,0:16:42:14:0,42,552:0,42,255:19,0
chr22	10510250	.	C	<NON_REF>	.	PASS	END=10510250	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:16,1:17:36:17:0,36,641:0,36,255:14,0
chr22	10510251	.	T	<NON_REF>	.	PASS	END=10510257	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:17,0:17:42:16:0,42,630:0,48,255:15,0
chr22	10510258	.	C	<NON_REF>	.	PASS	END=10510262	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:16,0:16:39:15:0,39,585:0,45,255:13,0
chr22	10510263	.	A	<NON_REF>	.	PASS	END=10510265	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:16,0:16:42:16:0,42,630:0,48,255:14,0
chr22	10510266	.	T	<NON_REF>	.	PASS	END=10510266	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:14,2:16:17:16:0,17,542:0,17,255:12,0
chr22	10510267	.	T	<NON_REF>	.	PASS	END=10510267	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:16,0:16:42:16:0,42,630:0,48,255:14,0
chr22	10510268	.	G	<NON_REF>	.	PASS	END=10510275	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:14,0:14:39:14:0,39,585:0,42,255:13,0
chr22	10510276	.	G	<NON_REF>	.	PASS	END=10510303	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:17,0:17:45:17:0,45,675:0,51,255:19,0
chr22	10510304	.	G	<NON_REF>	.	PASS	END=10510306	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:20,0:20:60:20:0,60,823:0,60,255:20,0
chr22	10510307	.	G	<NON_REF>	.	PASS	END=10510311	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:20,0:20:57:20:0,57,855:0,60,255:19,0
chr22	10510312	.	C	<NON_REF>	.	PASS	END=10510314	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22	10510315	.	G	<NON_REF>	.	PASS	END=10510318	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:48:20:0,48,782:0,48,255:19,0
chr22	10510319	.	G	<NON_REF>	.	PASS	END=10510319	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22	10510320	.	G	<NON_REF>	.	PASS	END=10510327	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:57:20:0,57,855:0,60,255:19,0
chr22	10510328	.	G	<NON_REF>	.	PASS	END=10510330	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22	10510331	.	G	<NON_REF>	.	PASS	END=10510331	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:19,1:20:45:20:0,45,752:0,45,255:19,0
chr22	10510332	.	A	<NON_REF>	.	PASS	END=10510334	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,0:21:60:20:0,60,797:0,60,255:22,0
chr22	10510335	.	C	<NON_REF>	.	PASS	END=10510335	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:21,1:22:51:22:0,51,862:0,51,255:21,0
chr22	10510336	.	C	<NON_REF>	.	PASS	END=10510340	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:23,0:23:66:23:0,66,937:0,69,255:24,0
chr22	10510341	.	C	<NON_REF>	.	PASS	END=10510341	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:23,1:24:57:24:0,57,933:0,57,255:22,0
chr22	10510342	.	T	<NON_REF>	.	PASS	END=10510344	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:24,0:24:60:24:0,60,900:0,72,255:20,0
chr22	10510345	.	A	<NON_REF>	.	PASS	END=10510345	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:23,0:23:57:23:0,57,855:0,69,255:19,0
@pd3
Copy link
Member

pd3 commented Mar 24, 2023

There is no hidden filtering, so probably you hit a rare bug. Merging of 1379 gVCF files is complex, hard to tell without seeing the data. Can you provide a small reproducible test case please?

@jcm6t
Copy link
Author

jcm6t commented Mar 24, 2023

Petr,
Thanks. I will try but before doing so, I wanted to check that my understanding of what should happen is correct and you seem to be affirming that. In general, I do think there is an increasing need for a performant tool to do very large scale merging of gvcfs/vcfs and I was hopeful that bcftools merge could be the one, not having used bcftools for this purpose before. There's GATK and that's fine if you are using their pipeline but because they apply additional logic in population calling, that may not be appropriate if you generate the gvcfs from a different starting pipeline. There are multiple other options for calling variants in WGS and generating gvcfs now including other bcftools, DRAGEN, etc. GLNEXUS doesn't seem to be actively supported and there are multiple outstanding issues with it.

You have probably already thought this through but I would say the situations that arise are:
PASS ./.
PASS 0/0
PASS 0/1 or 1/1 ie called variant in single sample
and the corresponding NON-PASS situations.

I also think this is sufficiently complex that a section in the more detailed bcftools web documentation would be useful. I looked there for an explanation and there is nothing explaining the merge logic and algorithm. However I do realize that documentation is not always top of the list ;-)

@jcm6t
Copy link
Author

jcm6t commented Mar 27, 2023 via email

@pd3
Copy link
Member

pd3 commented Mar 28, 2023

@jcm6t Your github profile does not show an email address, I listed mine on the profile page or you check the README in the bcftools distribution. Thanks

@pd3 pd3 closed this as completed in 60c65eb Apr 4, 2023
@pd3
Copy link
Member

pd3 commented Apr 4, 2023

Thank you for the test case, it was very helpful. The problem should be now fixed, please try it out. As this was a bigger change, please check the outputs carefully and let me know if you spot anything odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants