Extra rows and taxaIDs? #38

susheelbhanu · 2024-09-20T07:28:27Z

Hi,

Firstly, thank you for this amazing tool! I have a question regarding possible duplicates when running name2taxid on a larger list.

My list (column: name) below contains 17836 elements

>>> agrii_tax
ASV domain phylum class order family genus species name
0 ASV_1 Bacteria Chloroflexi KD4-96 NaN NaN NaN NaN KD4-96
1 ASV_2 Bacteria Verrucomicrobiota Verrucomicrobiae Chthoniobacterales Chthoniobacteraceae Candidatus Udaeobacter NaN Candidatus Udaeobacter
2 ASV_3 Bacteria Firmicutes Bacilli Bacillales NaN NaN NaN Bacillales
3 ASV_4 Bacteria Firmicutes Bacilli Bacillales Bacillaceae NaN NaN Bacillaceae
4 ASV_5 Bacteria Chloroflexi KD4-96 NaN NaN NaN NaN KD4-96
... ... ... ... ... ... ... ... ... ...
17831 ASV_17832 Bacteria NaN NaN NaN NaN NaN NaN Bacteria
17832 ASV_17833 NaN NaN NaN NaN NaN NaN NaN NaN
17833 ASV_17834 NaN NaN NaN NaN NaN NaN NaN NaN
17834 ASV_17835 Bacteria Planctomycetota Planctomycetes Pirellulales Pirellulaceae Pir4 lineage NaN Pir4 lineage
17835 ASV_17836 Bacteria Actinobacteriota Actinobacteria Micrococcales Microbacteriaceae NaN NaN Microbacteriaceae

[17836 rows x 9 columns]

However, when i run the name2taxid conversion on them I get the following:

>>> taxid_results = pytaxonkit.name2taxid(names)
>>> taxid_results
                         Name    TaxID    Rank
0                      KD4-96     <NA>    <NA>
1      Candidatus Udaeobacter  1921511   genus
2                  Bacillales     1385   order
3                 Bacillaceae   186817  family
4                      KD4-96     <NA>    <NA>
...                       ...      ...     ...
21965                Bacteria   629395   genus
21966                    <NA>     <NA>    <NA>
21967                    <NA>     <NA>    <NA>
21968            Pir4 lineage     <NA>    <NA>
21969       Microbacteriaceae    85023  family

21969 rows in the results compared to 17835 in the input. Is it possible that some 'names' are getting duplicate taxaIDs?

Thank you for your help with this,
Susheel

The text was updated successfully, but these errors were encountered:

standage · 2024-09-20T13:26:55Z

Hi @susheelbhanu.

What are the contents of the names variable? Can you confirm it is agrii_tax.name and indeed has 17836 elements?

I'm not sure what would cause the results to be larger than the input. Which version of TaxonKit and PyTaxonKit do you have installed? pytaxonkit.__version__ and pytaxonkit.__taxonkitversion__

susheelbhanu · 2024-09-20T13:44:33Z

Hey @standage,

Thanks for the quick reply. Here are the versions:

>>> pytaxonkit.__taxonkitversion__
'taxonkit v0.17.0'
>>> pytaxonkit.__version__
'0.8'

And this is what names contains

>>> names[:10]
['KD4-96', 'Candidatus Udaeobacter', 'Bacillales', 'Bacillaceae', 'KD4-96', 'Candidatus Udaeobacter', 'Candidatus Nitrocosmicus', 'Micrococcaceae', 'MB-A2-108', 'Gaiella']

standage · 2024-09-20T13:51:28Z

What is the length of names?

susheelbhanu · 2024-09-20T13:52:19Z

17835

standage · 2024-09-20T13:54:29Z

So it has 1 less element than agrii_tax, which has 17836 rows?

susheelbhanu · 2024-09-20T13:55:49Z

Sorry typo..

>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of names: 17836

susheelbhanu · 2024-09-20T13:56:30Z

Sorry typo..

>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of new_names: 17836

standage · 2024-09-20T14:01:12Z

This is unexpected behavior indeed. I'm not sure there's much more I can do unless you can share the entire contents of names in text file. I'll note that there appear to be quite a few NaN names which will give empty results. But if you are trying to maintain the correct shape of your data, I understand you may not want to drop those values. I don't think this is causing the issue, but again, I don't have enough information at the moment to be sure.

susheelbhanu · 2024-09-20T14:04:02Z

Thank you, I'm happy to share the file later tonight. And yes, I'm trying to keep the shape so as to merge it later with another file.

Appreciate your help with this!

susheelbhanu · 2024-09-20T15:18:25Z

Here's the file and how I get the names list.
TaxaId16s.csv

import pytaxonkit, os
import pandas as pd

# reading in the 16S taxa
agrii_tax = pd.read_csv("TaxaId16s.csv", header = 0)

# dropping the unnamed column
agrii_tax = agrii_tax.drop(columns=['Unnamed: 0'])

# Rename the 'ASVrank' column to 'ASV'
agrii_tax = agrii_tax.rename(columns={'ASVrank': 'ASV'})

# Move 'ASV' to the first column
cols = ['ASV'] + [col for col in agrii_tax.columns if col != 'ASV']
agrii_tax = agrii_tax[cols]

# Create a new column 'name' by finding the last non-NaN value in each row
agrii_tax['name'] = agrii_tax[['species', 'genus', 'family', 'order', 'class', 'phylum', 'domain']].bfill(axis=1).iloc[:, 0]

# Replace NaN values in the 'name' column with 'unclassified'
agrii_tax['name'].fillna('unclassified', inplace=True)

# Extract the 'name' column from your DataFrame
names = agrii_tax['name'].tolist()

# Run pytaxonkit.name2taxid with the names
taxid_results = pytaxonkit.name2taxid(names)

# To view the results
print(taxid_results)

Thank you!

standage · 2024-09-20T16:26:55Z

Ok, I understand the issue a bit better now. It doesn't appear to be an issue with TaxonKit or PyTaxonKit, but an artifact of the NCBI Taxonomy.

To investigate, I discarded all of the unclassified values, kept the remaining unique values, and performed the name2taxid query. As with your example, the output was larger than the input.

>>> mynames = list(set([n for n in names if n != "unclassified"]))
>>> len(mynames)
827
>>> taxid_results = pytaxonkit.name2taxid(mynames)
>>> taxid_results
                     Name    TaxID   Rank
0           Aeromicrobium     2040  genus
1            Pir2 lineage     <NA>   <NA>
2       Streptosporangium     2000  genus
3             Pedosphaera  1032526  genus
4         Polycyclovorans  1274363  genus
..                    ...      ...    ...
843             Duganella    75654  genus
844             Emticicia   312278  genus
845  Pleurocapsa PCC-7319     <NA>   <NA>
846            GWC2-45-44     <NA>   <NA>
847      Cyanobacteriales     <NA>   <NA>

[848 rows x 3 columns]

So there must be some duplicated values. I found them with the following code.

>>> taxid_results[taxid_results.Name.duplicated(keep=False)].sort_values("Name")
               Name    TaxID          Rank
418  Actinobacteria   201174        phylum
417  Actinobacteria   201174        phylum
762         Archaea     2157  superkingdom
761         Archaea     2157  superkingdom
214        Bacillus     1386         genus
215        Bacillus    55087         genus
830        Bacteria        2  superkingdom
831        Bacteria        2  superkingdom
832        Bacteria   629395         genus
82            Bosea    85413         genus
83            Bosea   169215         genus
768     Chloroflexi   200795        phylum
767     Chloroflexi    32061         class
568   Cyanobacteria     1117        phylum
567   Cyanobacteria     1117        phylum
177    Diplosphaera   381755         genus
178    Diplosphaera  1148783         genus
331      Firmicutes     1239        phylum
332      Firmicutes     1239        phylum
736        Gordonia    79255         genus
735        Gordonia     2053         genus
416          Labrys  2066135         genus
415          Labrys   204476         genus
584      Leptothrix       88         genus
585      Leptothrix  1907117         genus
758      Longispora   203522         genus
759      Longispora  2759766         genus
380      Nitrospira     1234         genus
381      Nitrospira   203693         class
187      Paracoccus      265         genus
188      Paracoccus   249411         genus
802  Planctomycetes      112         order
803  Planctomycetes   203682        phylum
804  Planctomycetes   203683         class
792  Proteobacteria     1224        phylum
793  Proteobacteria     1224        phylum
227     Rhodococcus     1827         genus
228     Rhodococcus  1661425         genus
311      Syntrophus  1671858         genus
310      Syntrophus    43773         genus

It turns out that some of these names are associated with multiple entries in the NCBI taxonomy files (names.dmp). Some of these entries are redundant (same name from different sources with the same taxid) while some names actually refer to different taxa. I'm afraid that resolving these nomenclature issues to identify the "correct" taxid for each name is outside the scope of pytaxonkit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra rows and taxaIDs? #38

Extra rows and taxaIDs? #38

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

Extra rows and taxaIDs? #38

Extra rows and taxaIDs? #38

Comments

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

susheelbhanu commented Sep 20, 2024

standage commented Sep 20, 2024