-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option -t / --targets is not fit for purpose: misses events starting before the region #1421
Comments
This has been brought up many times. I understand your frustration. However, sometimes it is actually desirable to include records starting in the regions, not overlapping it, still arriving at scientifically meaningful results. This has been a contentious issue from the beginning and has been reconsidered many times with the conclusion that backward compatibility should be maintained. Regarding the last two record, although the VCF record starts at position 200, the affected sequence starts after the position 200, the returned results are valid. |
Do you have links to other times it has been brought up? I find only #14, which presumably is what led to the sentence in the manual page documentation. Note for example the
which gives no indication that
I am all for backward compatibility. I am also all for tools providing facilities to answer the queries that users will ask of them. I will even accept the claim that “record starts within specified regions” and “record overlaps with specified regions” are both scientifically useful queries, though I would suggest that the latter is the more commonly desired. Have you considered adding options (e.g., (Yes, I could reformulate my pipeline to use |
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets. BCFtools (and HTSlib) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged. After samtools/htslib#1327 is merged, this commit resolves #1420 and #1421
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
This is to address a long-standing design flaw in handling regions and targets, as described in these BCFtools issues: samtools/bcftools#1420 samtools/bcftools#1421 HTSlib (and BCFtools) recognize two sets of behaviors / options for resctricting VCF/BCF files by region, one is for streaming (`-t/-T`) and one for index-gumping (`-r/-R`). They behave differently, the first includes only records with POS coordinate within the regions, the other includes overlapping regions. This allows to modify the default behavior and provides three options: - Include only records with POS starting in the regions/targets - Include VCF records that overlap regions/targets, even if POS itself is outside the regions - Include only VCF records where the true variation overlaps regions/targets, e.g. consider the difference between `TC>T-` and `C>-` Most importantly, this allows to make the regions and targets behave the same way. Note that the default behavior remains unchanged.
Given samtools/htslib#1327 is now merged and the corresponding bcftools commits are in (0d04159) we believe this to be fixed. |
Consider events.vcf (this is the same file as in #1420), which begins with the following VCF records:
If we query this for
chr1:100-200
, we would expect to receive thePASS
records but not theBEFORE
records. In particular:sv1
andsv2
haveEND
fields indicating that they extend into the specified region;d1
affects bases at positions 100–115, andd2
andd3
affect the base at position 100.and indeed these records are produced by a
bcftools view -r chr1:100-200
query.However bcftools's non-index-based
-t
/--targets
/-T
/--targets-file
query (both as released and on develop) fails to return these records:This behaviour has been documented for a long time (105c41e) and the manual page currently says
Nonetheless I for one was not aware of this behaviour before today, and I suspect many other bcftools users will be similarly unaware of it. Regardless of this sentence in the documentation, I believe most users would expect the
-t
option to correctly output records that overlap the specified region(s) (i.e., to produce the same results as-r
only more slowly) and will be disappointed to find that it silently omits some records that they would expect to have in their output. IMNSHO this means that‑t
/‑‑targets
is not fit for purpose as it does not produce scientifically meaningful results.It also incorrectly returns
outd2
andouti4
, which are just after the region of interest, as in #1420. This is arguably a less serious problem than omitting events that should be included.The text was updated successfully, but these errors were encountered: