Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing species designation when using sourmash lca classify on a Haemophilus influenzae #3546

Open
replikation opened this issue Feb 24, 2025 · 2 comments

Comments

@replikation
Copy link

Dear sourmash team,

I have been using your great tool for years now and stumbled upon a strange behavior.

Issue: Missing species description when using sourmash lca classify on a Haemophilus influenzae genome.

Example fasta: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=1355929925&rettype=fasta
(I tried a few other H. influenza strains, but the same missing species issue. However, different species work fine, such as E.coli, S. aureus, etc.)
Database used: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k31.lca.json.gz
(same issue on an older DB)

version sourmash 4.8.14

Results:

<style> </style>
ID status superkingdom phylum class order family genus species strain
NZ_CP020010.1 Haemophilus influenzae strain 67P38H1 chromosome, complete genome found d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Enterobacterales_A f__Pasteurellaceae g__Haemophilus    

side info:
sourmash results when using sourmash gather

<style> </style>
intersect_bp f_orig_query f_match f_unique_to_query f_unique_weighted average_abund median_abund std_abund filename name md5 f_match_orig unique_intersect_bp gather_result_rank remaining_bp query_filename query_name query_md5 query_bp ksize moltype scaled query_n_hashes query_abundance query_containment_ani match_containment_ani average_containment_ani max_containment_ani potential_false_negative n_unique_weighted_found sum_weighted_found total_weighted_hashes  
1840000 1 1 1 1       gtdb-rs214-k31.lca.json.gz GCF_002966675.1 Haemophilus influenzae strain=67P56H1, ASM296667v1 2f2a17a4bfe161d8e25ecbc0beaffd27 1 1840000 0 0 h.influenzae.fasta   e0f34f5b 184000 31 DNA 10000 184 FALSE 1 1 1 1 FALSE   184 184  
@ctb
Copy link
Contributor

ctb commented Feb 24, 2025

thanks for reporting, @replikation!

I did:

grep GCF_002966675.1  gtdb-rs214.lineages.csv

and it does indeed have a species designation;

GCF_002966675.1,f,d__Bacteria,p__Pseudomonadota,c__Gammaproteobacteria,o__Enterobacterales_A,f__Pasteurellaceae,g__Haemophilus,s__Haemophilus influenzae

I am wondering if maybe the k-mers found were also largely shared with other species in g__Haemophilus? That would, I think, explain the result - LCA "pulls" the classification back to the lowest common ancestor that matches the hashes. Since you gave me enough info to reproduce, I'll take a look!

@replikation
Copy link
Author

Great, let me know if you need anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants