Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option -t / --targets is not fit for purpose: misses events starting before the region #1421

Closed
jmarshall opened this issue Feb 18, 2021 · 3 comments

Comments

@jmarshall
Copy link
Member

jmarshall commented Feb 18, 2021

Consider events.vcf (this is the same file as in #1420), which begins with the following VCF records:

chr1	90	outsv1	A	ATGC	.	BEFORE	END=99
chr1	90	sv1	A	ATGC	.	PASS	END=100
chr1	90	sv2	A	ATGC	.	PASS	END=110
chr1	95	d1	AATGCATGCATGCATGCATGC	A	.	PASS	.
chr1	98	outd1	AT	A	.	BEFORE	.
chr1	98	d2	ATT	A	.	PASS	.
chr1	98	outi1	A	AT	.	BEFORE	.
chr1	99	outs1	A	T	.	BEFORE	.
chr1	99	d3	AT	A	.	PASS	.
chr1	99	outi2	A	AT	.	BEFORE	.
chr1	99	outi3	A	ATTTT	.	BEFORE	.
chr1	100	s1	A	T	.	PASS	.
chr1	100	d4	AT	A	.	PASS	.
chr1	100	i1	A	ATTTT	.	PASS	.
… …

If we query this for chr1:100-200, we would expect to receive the PASS records but not the BEFORE records. In particular:

  • sv1 and sv2 have END fields indicating that they extend into the specified region;
  • d1 affects bases at positions 100–115, and d2 and d3 affect the base at position 100.

and indeed these records are produced by a bcftools view -r chr1:100-200 query.

However bcftools's non-index-based -t/--targets/-T/--targets-file query (both as released and on develop) fails to return these records:

$ bcftools view -t chr1:100-200 events.vcf
##fileformat=VCF4.3
… …
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	100	s1	A	T	.	PASS	.
chr1	100	d4	AT	A	.	PASS	.
chr1	100	i1	A	ATTTT	.	PASS	.
chr1	101	s2	A	T	.	PASS	.
chr1	101	i2	A	AT	.	PASS	.
chr1	198	i3	A	ATTTTT	.	PASS	.
chr1	199	s3	A	T	.	PASS	.
chr1	199	d5f	AT	A	.	PASS	.
chr1	199	i4f	A	ATTTT	.	PASS	.
chr1	200	s4f	A	T	.	PASS	.
chr1	200	outd2	AT	A	.	AFTER	.
chr1	200	outi4	A	AT	.	AFTER	.

This behaviour has been documented for a long time (105c41e) and the manual page currently says

-t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]

Yet another difference between the -t/-T and -r/-R [options] is that -r/-R checks for proper overlaps and considers both POS and the end position of an indel, while -t/-T considers the POS coordinate only.

Nonetheless I for one was not aware of this behaviour before today, and I suspect many other bcftools users will be similarly unaware of it. Regardless of this sentence in the documentation, I believe most users would expect the -t option to correctly output records that overlap the specified region(s) (i.e., to produce the same results as -r only more slowly) and will be disappointed to find that it silently omits some records that they would expect to have in their output. IMNSHO this means that ‑t/‑‑targets is not fit for purpose as it does not produce scientifically meaningful results.


It also incorrectly returns outd2 and outi4, which are just after the region of interest, as in #1420. This is arguably a less serious problem than omitting events that should be included.

@pd3
Copy link
Member

pd3 commented Feb 18, 2021

This has been brought up many times. I understand your frustration. However, sometimes it is actually desirable to include records starting in the regions, not overlapping it, still arriving at scientifically meaningful results. This has been a contentious issue from the beginning and has been reconsidered many times with the conclusion that backward compatibility should be maintained.

Regarding the last two record, although the VCF record starts at position 200, the affected sequence starts after the position 200, the returned results are valid.

@jmarshall
Copy link
Member Author

This has been brought up many times.

Do you have links to other times it has been brought up? I find only #14, which presumably is what led to the sentence in the manual page documentation.

Note for example the bcftools view usage:

    -r, --regions <region>              restrict to comma-separated list of regions
    -R, --regions-file <file>           restrict to regions listed in a file
    -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
    -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix

which gives no indication that -r and -t in fact produce entirely different results, apparently by design.

However, sometimes it is actually desirable to include records starting in the regions, not overlapping it, […] backward compatibility should be maintained.

I am all for backward compatibility. I am also all for tools providing facilities to answer the queries that users will ask of them. I will even accept the claim that “record starts within specified regions” and “record overlaps with specified regions” are both scientifically useful queries, though I would suggest that the latter is the more commonly desired.

Have you considered adding options (e.g., --targets-overlap REGION, or a flag to change the behaviour of the existing -t) so that either desired query could be made in a streaming context? This would add flexibility and also provide documentation as to the two behaviours.

(Yes, I could reformulate my pipeline to use -r instead of -t. However I'm starting from a small uncompressed text VCF file, and if I had to compress and index it in order to use -r, I might as well just write the filtering myself in Perl instead.)

pd3 added a commit to pd3/htslib that referenced this issue Sep 7, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
pd3 added a commit that referenced this issue Sep 7, 2021
This is to address a long-standing design flaw in handling regions and targets. BCFtools (and HTSlib)
recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming
(`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with
POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default
behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.

After samtools/htslib#1327 is merged, this commit resolves #1420 and #1421
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
pd3 added a commit to pd3/htslib that referenced this issue Sep 8, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
whitwham pushed a commit to samtools/htslib that referenced this issue Sep 9, 2021
This is to address a long-standing design flaw in handling regions and targets,
as described in these BCFtools issues:
    samtools/bcftools#1420
    samtools/bcftools#1421

HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one
is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes
only records with POS coordinate within the regions, the other includes overlapping regions. This allows
to modify the default behavior and provides three options:

- Include only records with POS starting in the regions/targets
- Include VCF records that overlap regions/targets, even if POS itself is outside the regions
- Include only VCF records where the true variation overlaps regions/targets, e.g. consider the
  difference between `TC>T-` and `C>-`

Most importantly, this allows to make the regions and targets behave the same way.

Note that the default behavior remains unchanged.
@jkbonfield
Copy link
Contributor

Given samtools/htslib#1327 is now merged and the corresponding bcftools commits are in (0d04159) we believe this to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants