Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible genome_accno between accession.feather and taxa.feather #36

Open
GhadaNOUAIRIA opened this issue Jul 20, 2020 · 7 comments

Comments

@GhadaNOUAIRIA
Copy link

GhadaNOUAIRIA commented Jul 20, 2020

The following genome_accnos appear in accessions.feather and find no corresponding taxonomy in taxa.feather, as following:

genome_accno_in_accessions \t genome_accno_in_taxa
RS_GCF_900078535 \t GCA_900078535
RS_GCF_001308105 \t` GCA_001308105

GB_GCA_900128825 \t GCA_900128825

In processed_genomes.txt (and in all_genomes.faa) the genome name exists in gtdb_metadata.tsv as in the accession.feather (RS_GCF_xxxx or GB_GCA_xxxx).
While in gtdb_metadata.tsv file, we have genome_accno_a_in_accessions in field 1, and GCA_xxxxx in field 55 (as in genome_accno_in_taxa).
Example:
$ grep 001308105 gtdb_metadata.tsv |cut -f 1,15,55
RS_GCF_001308105.1 RS_GCF_001308105.1 GCA_001308105.1

I end up with non corresponding genome_accno in the different feather files.

@erikrikarddaniel
Copy link
Owner

I don't get any hit when running your grep command: grep 001308105 /data/gtdb_pfitmap/hmm_profile_hierarchy.tsv.

@GhadaNOUAIRIA
Copy link
Author

Why would you grep a genome accno (001308105) in the list of hmm profiles (hmm_profile_hierarchy.tsv) ?

@erikrikarddaniel
Copy link
Owner

Good question, my mistake.

@erikrikarddaniel
Copy link
Owner

It seems the RS_ is not removed from the accessions of these genomes.

@GhadaNOUAIRIA
Copy link
Author

Yes, but also GCF to GCA change makes the genome accno incompatible between feather files

@erikrikarddaniel
Copy link
Owner

erikrikarddaniel commented Aug 10, 2020

We have four genomes in /data/gtdb_pfitmap/results/genomes/processed_genomes.txt that have fasta file names starting with "RS_" or "GB_". None of them are currently in /data/gtdb_pfitmap/genomes/ nor are there links without the prefix. They hence seem to have been removed. Corresponding files and directories are still in /data/gtdb/genomes. Moreover, as they are in the bac120_metadata.tsv, I suppose they should be included.

Filenames starting with these prefixes will lead to sequences having genome accession numbers also having these prefixes, which seems to be the reason they end up in the accessions table like this. In the taxa table however, such prefixes are removed by the classification. This should explain the mismatch between the two tables.

I have two questions: Why are these four named like this and why are they gone from /data/gtdb_pfitmap/genomes?

The problem is easy enough to fix, but I want to make sure it doesn't just reappear. If you don't remember, I'll just fix the file names and add back the links before rerunning the workflow.

@GhadaNOUAIRIA
Copy link
Author

GhadaNOUAIRIA commented Aug 10, 2020

1. why are they gone from /data/gtdb_pfitmap/genomes?
I remember you told me you used the directory /data/gtdb/genomes/ for the workflow run, not data/gtdb_pfitmap/genomes.
RS_GCF_900078535.2_genomic.fna.gz, GB_GCA_900128825.1_genomic.fna.gz and RS_GCF_001308105.1_genomic.fna.gz exist in /data/gtdb/genomes/ as you said, and that's why they were run by the workflow.
BTW; the last changes I made (removing extra genomes that we don't need and that made some error message in classification), I made them in the directory /data/gtdb/genomes/ . This is the most updated list of genomes we should use for the next run. So please, either use this directory or update /data/gtdb_pfitmap/genomes before the run.

2. Why are these four named like this?
I don't really remember. But when I look for the three accessions (GCF_900078535, GCA_900128825, GCF_001308105) in ncbi, they are replaced or suppressed. BUT, all three are in the representatives (which we downloaded directly from GTDB repository). So maybe the name difference has something to do with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants