Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] 3 small VCF processing tools: variantdistance, streamsort and unique #1690

Closed
WimSpee opened this issue Mar 28, 2022 · 4 comments

Comments

@WimSpee
Copy link

WimSpee commented Mar 28, 2022

Dear BCFTools developers,

Our genotype exports on large 100GB+ Whole Genome Sequencing BCF files are currently limited in performance by dependency on 3 vcflib tools that require the export pipe to switch to VCF/text format. And then back to BCF format.

Functionality Use case VCFlib tool
variantdistance knowledge if variant has a free flank for assay design. Adds a tag to each variant record which indicates the distance to the nearest variant. Defaults to BasesToClosestVariant if no custom tag name is given. Multiple distance annotation with custom INFO field name makes sense to know the distance in the population and in samples of interest before/after extra (e.g. quality) filtering. https://github.com/vcflib/vcflib/blob/master/doc/vcfdistance.md
StreamSort Ensuring variants after normalization are in sorted order https://github.com/vcflib/vcflib/blob/master/doc/vcfstreamsort.md
Unique Ensuring variants after normalization are unique https://github.com/vcflib/vcflib/blob/master/doc/vcfuniq.md

We would like to be able to run the export pipe fully in BCF format stream. This would mean a large performance gain/reduction in wall time for exports.

It does not look like BCF support will be added to vcflib
vcflib/vcflib#304

Would it make sense to support the above functionality in BCFTools? Thank you for considering the above.

@pd3
Copy link
Member

pd3 commented Mar 29, 2022

I believe the StreamSort has its alternative in bcftools sort and Unique in bcftools norm --rm-dup.

Regarding variantdistance, yes, this could be added. I am not sure what the best place for this would be, the +prune plugin is currently best positioned to do that. Alternatively, the +fill-tags plugin could be extended.

@WimSpee
Copy link
Author

WimSpee commented Mar 29, 2022

I just checked and we actually do already do bcftools sort and vclib uniq as extra trips to disk on the output of the main genotype export pipe BCF file.

The input might be 100GB+, the genotype export results are in the range of few MB to few GB. So performance penalties of extra trips to disk do not matter there much.

variantdistance annotation we do at least twice:

  1. once on the original 100GB+ population variant calling BCF file
  2. frequently and early in each WGS genotype export pipe direct after sub setting to samples of interest

So variantdistance annotation we do often and on a lot of data.
And there is not yet a tool that can update variants in BCF stream / pipe with this information.

Therefore variantdistance is actually the only and main feature request that we have.

It would be very nice if variantdistance could become available in bcftools or in a plugin.

@pd3 pd3 closed this as completed in 425fb67 Apr 6, 2022
@pd3
Copy link
Member

pd3 commented Apr 6, 2022

This was added as a new plugin variant-distance and can be used as

bcftools +variant-distance input.vcf

pd3 added a commit that referenced this issue Apr 6, 2022
The plugin adds a custom annotation to indicate the distance to the
nearest variant

Resolves #1690
@WimSpee
Copy link
Author

WimSpee commented Apr 7, 2022

Thank you for adding this feature.

pd3 added a commit that referenced this issue May 13, 2022
The plugin adds a custom annotation to indicate the distance to the
nearest variant

Resolves #1690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants