-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatible genome_accno between accession.feather and taxa.feather #36
Comments
I don't get any hit when running your grep command: |
Why would you grep a genome accno (001308105) in the list of hmm profiles (hmm_profile_hierarchy.tsv) ? |
Good question, my mistake. |
It seems the |
Yes, but also GCF to GCA change makes the genome accno incompatible between feather files |
We have four genomes in Filenames starting with these prefixes will lead to sequences having genome accession numbers also having these prefixes, which seems to be the reason they end up in the accessions table like this. In the taxa table however, such prefixes are removed by the classification. This should explain the mismatch between the two tables. I have two questions: Why are these four named like this and why are they gone from The problem is easy enough to fix, but I want to make sure it doesn't just reappear. If you don't remember, I'll just fix the file names and add back the links before rerunning the workflow. |
1. why are they gone from /data/gtdb_pfitmap/genomes? 2. Why are these four named like this? |
The following genome_accnos appear in accessions.feather and find no corresponding taxonomy in taxa.feather, as following:
genome_accno_in_accessions
\t
genome_accno_in_taxaRS_GCF_900078535
\t
GCA_900078535RS_GCF_001308105
\t` GCA_001308105GB_GCA_900128825
\t
GCA_900128825In processed_genomes.txt (and in all_genomes.faa) the genome name exists in gtdb_metadata.tsv as in the accession.feather (RS_GCF_xxxx or GB_GCA_xxxx).
While in gtdb_metadata.tsv file, we have genome_accno_a_in_accessions in field 1, and GCA_xxxxx in field 55 (as in genome_accno_in_taxa).
Example:
$ grep 001308105 gtdb_metadata.tsv |cut -f 1,15,55
RS_GCF_001308105.1 RS_GCF_001308105.1 GCA_001308105.1
I end up with non corresponding genome_accno in the different feather files.
The text was updated successfully, but these errors were encountered: