Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--proteins external database doesn't give excpected assignements #719

Open
fconstancias opened this issue Feb 5, 2025 · 0 comments
Open

Comments

@fconstancias
Copy link

fconstancias commented Feb 5, 2025

Thanks for developing all these really helpful tools for the microbial ecology community!

I would like to annotate Streptococcus pneumoniae genomes for genes involved in bacteriocin (antimicrobial/immunity/regulation). I have download the genes I am interested in in amino acid sequences from UNIPROT and ideally I would like to include annotation from specific database to my gbff or gff files I have downloaded from NCBI.

I am exploring the --proteins flag from prokka to see if it can help me to achieve this objective

I formatted my database according to the instructions:

>A0A384ZZZ3 ~~~sliC~~~~~~
MDENKVIIDLSEKVFAKFDEQLKRYAEQPNYDLLTLSSGLPGLILLSSELTSLTSERKYS
ARTGKYVNFMVKQMRNYGVLSDSLFSGVSGIGISILHLVEEHPEYHNLLISFNEYIKYYT
LSKIENIDIKKISPTDYDIIEGVSGVLVYLLSQEQDENDYIINRIINFLSEFSLKNSTLT
GFYVESKNQMSKTESKLYPLGCLNFGLAHGLAGVGAMLSYSKLKGYSNEKSIAAIKKIIM
LYEKHELKNYMWKEGLSDIELKKTEKSNLQYEFIRDAWCYGSPGISLLYLYSSLALEDKK
LKSKACNILKASIRRSNGLEQSILCHGFSGAIEICLFFKKIYKTTDFDDCIKSLKEKLIS
DFREDMTYGFNTTAEFENIKTKDNLGYLDGIIGILLTMIELNNLKVTTNWQRALLLFDDV
IKEVK
>A0A0H2UNX0 ~~~blpB~~~~~~
MNPNLFRSVEFYQRRYHNYATVLIIPLSLLFTFILIFSLVATKEITVTSQGEIAPTSVIA
SIQSTSDNPILANHLVANQVVEKGDLLIKYSETMEESQKTALATQLQRLEKQKEGLGILK
QSLEKATDLFSGEDEFGYHNTFMNFTKQSHDIELGITKTNTEVSNQANLSNSSSSAIEQE
ITKVQQQIGEYQELRDAIINNRARLPTGNPHQSILNRYLVASQGQTQGTAEEPFLSQINQ
SIAGLESSIASLKIQQAGIGSVATYDNSLATKIEVLRTQFLQTASQQQLTVENQLTELKV
QLDQATQRLENNTLTSPSKGIVHLNSEFEGKNRIPTGTEIAQIFPVITDTREVLITYYVS
SDYLPLLDKGQTVRLKLEKIGNHGTTIIGQLQTIDQTPTRTEQGNLFKLTALAKLSNEDS
KLIQYGLQGRVTSVTTKKTYFDYFKDKILTHSD

and I am surprised to see that only 1 gene was annotated using this custom database:

[13:45:55] Running: prodigal -i prokka\/PROKKA_02052025\.fna -c -m -g 11 -p single -f sco -q
[13:45:56] Found 1984 CDS
[13:45:56] Connecting features back to sequences
[13:45:56] Not using genus-specific database. Try --usegenus to enable it.
[13:45:56] Preparing user-supplied primary BLAST annotation source: protein_sequences_prokka_ready.faa
[13:45:56] Guessed source was in fasta format.
[13:45:56] Running: makeblastdb -dbtype prot -in protein_sequences_prokka_ready\.faa -out prokka\/proteins -logfile /dev/null
[13:45:56] Using /inference source as 'protein_sequences_prokka_ready.faa'
[13:45:56] Annotating CDS, please be patient.
[13:45:56] Will use 3 CPUs for similarity searching.
[13:45:57] There are still 1984 unannotated CDS left (started with 1984)

example from the proteins.tmp.blast file

Query= 259

Length=61
Score E
Sequences producing significant alignments: (Bits) Value

A0A062WQJ3 ~~~cibA 118 2e-39

A0A062WQJ3 ~~~cibA~~~~~~
Length=61

Score = 118 bits (295), Expect = 2e-39, Method: Compositional matrix adjust.
Identities = 61/61 (100%), Positives = 61/61 (100%), Gaps = 0/61 (0%)

Query 1 MTNFDILDNQFLSLSENELSDIDGGLAPLVIFGVAVSWKAIAGGTALIGSGLAAGYFLGG 60
MTNFDILDNQFLSLSENELSDIDGGLAPLVIFGVAVSWKAIAGGTALIGSGLAAGYFLGG
Sbjct 1 MTNFDILDNQFLSLSENELSDIDGGLAPLVIFGVAVSWKAIAGGTALIGSGLAAGYFLGG 60

Query 61 D 61
D
Sbjct 61 D 61

When I blasted the genomic.faa of that particular genomes against my local blast database built from the same .faa file I got quit some significant hits.

Image

Do you see what I am missing here?

Many thanks for your input, suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant