Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gff file mismatch -- non-printable character #119

Closed
cizydorczyk opened this issue May 4, 2023 · 14 comments
Closed

Gff file mismatch -- non-printable character #119

cizydorczyk opened this issue May 4, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@cizydorczyk
Copy link

cizydorczyk commented May 4, 2023

When I try running AMRFinderPlus on Bakta output (faa, fna, and gff3 files), I get the following error:

GFF file mismatch.
*** ERROR ***
File bakta-unicycler/BI_12_1018/BI_12_1018.gff3, line 229: std::string Common_sp::unpercent(const string&):
Non-printable character: -30
Stack:
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x15ca4) [0x557b38ba9ca4]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x2470c) [0x557b38bb870c]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0xe9bc) [0x557b38ba29bc]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x1fc60) [0x557b38bb3c60]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x94b5) [0x557b38b9d4b5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fd5d81e1d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fd5d81e1e40]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x99e1) [0x557b38b9d9e1]

And the offending line looks like:
contig_1 Prodigal CDS 99205 101682 . - 0 ID=BMMFLM_00545;Name=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BMMFLM_00545;product=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);gene=recD;Parent=BMMFLM_00545_gene;inference=ab initio prediction:Prodigal:2.6;Note=COG:COG0507,COG:L,RefSeq:WP_001283316.1,SO:0001217,UniParc:UPI00003B1729,UniRef:UniRef100_A0A8D9YQT4,UniRef:UniRef50_K9B570,UniRef:UniRef90_A0A0D6H7H2

This has happened for several S. aureus genomes (20 genomes), and the offending line is always this annotated gene.

The command to run AMRFinderPlus was:
$ for i in $(cat bsi-isolate-list.txt); do amrfinder -p bakta-unicycler/${i}/${i}.faa -n bakta-unicycler/${i}/${i}.fna -g bakta-unicycler/${i}/${i}.gff3 --annotation_format bakta --name ${i} -o amrfinderplus-unicycler/${i}-amrfinderplus-output.txt -O Staphylococcus_aureus --threads 6 --protein_output amrfinderplus-unicycler/${i}-protein-seq.fasta --nucleotide_output amrfinderplus-unicycler/${i}-nucl-seq.fasta --plus; done

May be a bakta issue, but perhaps AMRFinderPlus doesn't like the single quotes in the product?

Any help is appreciated.
Thanks.

@vbrover
Copy link
Contributor

vbrover commented May 4, 2023

Could send us this gff3 file?

@cizydorczyk
Copy link
Author

Here it is.

Sample1018.gff3.gz

When I removed the single quotes from the offending line (line 229 in this file), AMRFinderPlus ran just fine. Oddly, single quotes in other lines did not cause any issue.

@vbrover
Copy link
Contributor

vbrover commented May 4, 2023

Line 229 is
contig_1 Prodigal CDS 99205 101682 . - 0 ID=BMMFLM_00545;Name=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BMMFLM_00545;product=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);gene=recD;Parent=BMMFLM_00545_gene;inference=ab initio prediction:Prodigal:2.6;Note=COG:COG0507,COG:L,RefSeq:WP_001283316.1,SO:0001217,UniParc:UPI00003B1729,UniRef:UniRef100_A0A8D9YQT4,UniRef:UniRef50_K9B570,UniRef:UniRef90_A0A0D6H7H2

@vbrover
Copy link
Contributor

vbrover commented May 4, 2023

Is this 5’-3’ normal for your GFF files?

@evolarjun
Copy link
Contributor

It looks to me like some kind of smart quote substitution may have struck. Did you happen to open/save the file in a word processor or editor that might have substituted the single ticks?

We recently started being more rigorous and correct about our handling of the GFF3 standard for URL-encoded special characters after discovering that BLAST reacts badly to some characters in the sequence identifiers. If this is a common issue with interpreting BAKTA output then we should figure something out with @oschwengers. Either we change things or he does so that we maintain compatibility.

@vbrover
Copy link
Contributor

vbrover commented May 4, 2023

Could you send us all 3 files (faa, fna, and gff3 files) to test?

@cizydorczyk
Copy link
Author

cizydorczyk commented May 4, 2023

Odd that the quotes appear weird -- they don't appear so in geddit or VSCode on my system (Ubuntu 22.04).

I did not open the gff files and re-save them; this happens for 20 files (Unicycler assembly + Bakta annotation) and another 20 (SKESA assembly + Bakta annotation).

Here are the 3 files:

Sample1018.faa.gz
Sample1018.fna.gz
Sample1018.gff3.gz

@vbrover
Copy link
Contributor

vbrover commented May 4, 2023

Fixed in ver. 3.11.13:

$ amrfinder -g Sample1018.gff3 -p Sample1018.faa -n Sample1018.fna -d publish --plus -o aa -a bakta  -O Staphylococcus_aureus --threads 6 --protein_output aa.prot --nucleotide_output aa.nucl
Running: amrfinder -g Sample1018.gff3 -p Sample1018.faa -n Sample1018.fna -d publish --plus -o aa -a bakta -O Staphylococcus_aureus --threads 6 --protein_output aa.prot --nucleotide_output aa.nucl
Software directory: '/home/brovervv/code/amrfinder/'
Software version: 3.11.13
Database directory: '/home/brovervv/work/AMR/AMRFinder/publish'
Database version: 2023-05-04.1
AMRFinder combined translated and protein and mutation search
Running blastp
Running hmmsearch
Running tblastn
Running blastn
Making report
AMRFinder took 12 seconds to complete

@oschwengers
Copy link

Thanks @evolarjun for the heads up. So, just to be sure, was this a fix within amrfinder or a workaround for Bakta? If there is anything wrong with the Bakta GFF quoting, please let me know, so I can fix and appropriately handle that in Bakta.

@evolarjun
Copy link
Contributor

evolarjun commented May 5, 2023

Hi @oschwengers, it looks like the user is reporting output of BAKTA containg an extended character which amrfinderplus didn't handle well. I thought it was unlikely that BAKTA was naming genes with something other than base ASCII, but in case it isn't I thought you should know.

We have an internal fix for AMRFinderPlus and will release it soon to handle this. The characters don't actually impact the part of the GFF that AMRFinderPlus needs. Extensive testing of BLAST showed us that that BLAST can behave differently when identifiers contain certain characters so we added some, probably too extensive, input checking to AMRFinderPlus.

@vbrover
Copy link
Contributor

vbrover commented May 6, 2023

was this a fix within amrfinder or a workaround for Bakta?

Fix within amrfinder.

@evolarjun
Copy link
Contributor

@oschwengers I ran a test using the BAKTA web interface and it appears that an annotation of the attached Sample1018.fna.gz does include a line with the UTF-8 character (\xE28099) used for the tick marks in the ATPase/5'-3' helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V). I assume this comes from one of the underlying databases BAKTA uses, and from my reading UTF-8 is formally allowed by the GFF3 spec.

I'm not sure whether you consider this a bug or not, but it's the only place this occurs that I have found. All other 5' and 3' I can find in the GFF output I've seen from BAKTA use the standard ASCII 39 '

Once we finish up the testing we'll release a version of AMRFinderPlus that will handle input like this ok. I just wanted to let you know in case using unicode characters in BAKTA output is something you want to avoid.

@oschwengers
Copy link

Hi @evolarjun, thanks a lot for letting me know. I'm not quite sure if this should be handled as a bug. If it's covered by the GFF3 spec it should be OK, otherwise, if this is causing issues with several tools, I'll happily do my best to avoid these things. Simplest thing would be to add a product revision rule replacing \xE28099 by a standard ASCII '.

@evolarjun evolarjun added the bug Something isn't working label May 9, 2023
evolarjun added a commit that referenced this issue May 10, 2023
AMRFinderPlus release 3.11.14

This release addresses a few issues brought up on GitHub. We weren't able to solve all of them when we couldn't reproduce them, but we are trying.

Changes:
- On failure no `-o` output file is created - #115
- AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (relies on gunzip being in PATH) - #61
- AMRFinderPlus does not support unicode, but it no longer checks GFF files to prohibit unicode characters specifically - #119
- Add reporting of curl error messages - #120
@evolarjun
Copy link
Contributor

Release 3.11.14 relaxes the checking on GFF files so GFFs containing "5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V)" should now work out of the box.

UTF-8 in general is not supported however, so we don't make guarantees on how AMRFinderPlus will handle other characters.

evolarjun pushed a commit to bioconda/bioconda-recipes that referenced this issue May 10, 2023
This release addresses a few issues brought up on GitHub.

Changes:
- On failure no `-o` output file is created - ncbi/amr#115
- AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (this relies on gunzip being in PATH) - ncbi/amr#61
- AMRFinderPlus does not support unicode, but it will not check GFF files to prohibit extended ASCII or UTF-8 characters specifically (still prohibits GFF files with ASCII control characters 0x00 and 0x1F) - ncbi/amr#119
- Add reporting of curl error messages - ncbi/amr#120
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants