Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement outputting .vcf from kbo map #8

Closed
wants to merge 15 commits into from
Closed

Conversation

tmaklin
Copy link
Owner

@tmaklin tmaklin commented Mar 4, 2025

  • Add -f/--format toggle to specify the output format (default .aln, can also write .vcf)

.vcf from kbo-cli have this format:

##fileformat=VCFv4.4
##contig=<ID=30224_1#305_1,length=4971108>
##fileDate=20250304
##source=kbo-cli v0.1.1
##reference=30224_1#305_1.fna
##phasing=none
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	unknown
30224_1#305_1	28660	.	C	T	.	.	.	GT	1
30224_1#305_1	87002	.	A	G	.	.	.	GT	1
30224_1#305_1	169420	.	G	A	.	.	.	GT	1

When writing .vcf, kbo-cli will always process and report results for the reference contigs separately. This is different from writing .aln, where the contigs are still processed separately but the results are concatenated.

Building the .vcf header and records is done using noodles_vcf.

Caveats:

  • .vcf files currently only contain SNPs.
  • INDELs should be possible algorithmically but require some research.
  • map does not handle SNPs that are very close to each other (<< k), this may be possible to resolve by traversing the SBWT whenever short gaps are encountered.
  • Default options to map are not good if the reference is very fragmented, can get better results by changing -k and --max-error-prob.

@tmaklin
Copy link
Owner Author

tmaklin commented Mar 20, 2025

Changes merged into #9

@tmaklin tmaklin closed this Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant