Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to determine SNP frequency in multi-copy genes (e.g., 23S rRNA). #337

Open
slvrshot opened this issue Jun 13, 2023 · 2 comments
Open

Comments

@slvrshot
Copy link

Hello Martin,

In your paper it is stated that ARIBA...."4) identifies SNP frequency in multicopy genes, which has been traditionally difficult to resolve due to the complexities of de novo assembly".

Do you mean you are able to detect how many copies (let's use 23S rRNA [4 copies] in this example) are actually mutated? How is this reported? And how is ARIBA able to do this? I have seen others employ custom scripts (usually involving breseq or bwatools) that involve mapping reads to a single copy of the gene and then determining the frequency of that base found in all the reads that align to a particular SNP position.

Thanks!

@martinghunt
Copy link
Collaborator

It can't tell how many copies there are. If there are differences between the copies then it can give an estimate of the ratio of the copies...

It looks at pileup from the reads mapped to the assembled contig. Unless the copies are identical, in the report file (https://github.com/sanger-pathogens/ariba/wiki/Task:-run#report-file) you'll see "heterozygous" snps. The smtls_nts column will have more than one nucleotide, and then the corresponding depths of each nucleotide are in the smtls_nts_depth column. You'll probably also see that the flag (https://github.com/sanger-pathogens/ariba/wiki/Task:-flag) has variants_suggest_collapsed_repeat.

Also, multiple copies can be flagged because their flanking sequences will be different. In this case, the flag will have scaffold_graph_bad (because it's ambiguous which contig ends can link together).

@slvrshot
Copy link
Author

Martin thanks for the explanation. So to be clear it can give an estimate of the ratio of copies with a particular mutation?

For example I have the following...I am interested in 2045G. The mutation is detected. So here G,A, are represented meaning that the copies are "heterozygous". And the depth of G is 703, and 2 for A. (See below)

703/722 = 0.937
2/722 = 0.0027

I would gather that based on smtls_total_depth that we can estimate all of the copies are likely mutated with this particular mutation? Does this sound correct? Thanks again for developing and supporting this tool.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants