Gff file mismatch -- non-printable character #119

cizydorczyk · 2023-05-04T03:19:47Z

When I try running AMRFinderPlus on Bakta output (faa, fna, and gff3 files), I get the following error:

GFF file mismatch.
*** ERROR ***
File bakta-unicycler/BI_12_1018/BI_12_1018.gff3, line 229: std::string Common_sp::unpercent(const string&):
Non-printable character: -30
Stack:
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x15ca4) [0x557b38ba9ca4]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x2470c) [0x557b38bb870c]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0xe9bc) [0x557b38ba29bc]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x1fc60) [0x557b38bb3c60]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x94b5) [0x557b38b9d4b5]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fd5d81e1d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fd5d81e1e40]
/home/anaconda3/envs/amrfinderplus-env/bin/gff_check(+0x99e1) [0x557b38b9d9e1]

And the offending line looks like:
contig_1 Prodigal CDS 99205 101682 . - 0 ID=BMMFLM_00545;Name=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BMMFLM_00545;product=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);gene=recD;Parent=BMMFLM_00545_gene;inference=ab initio prediction:Prodigal:2.6;Note=COG:COG0507,COG:L,RefSeq:WP_001283316.1,SO:0001217,UniParc:UPI00003B1729,UniRef:UniRef100_A0A8D9YQT4,UniRef:UniRef50_K9B570,UniRef:UniRef90_A0A0D6H7H2

This has happened for several S. aureus genomes (20 genomes), and the offending line is always this annotated gene.

The command to run AMRFinderPlus was:
$ for i in $(cat bsi-isolate-list.txt); do amrfinder -p bakta-unicycler/${i}/${i}.faa -n bakta-unicycler/${i}/${i}.fna -g bakta-unicycler/${i}/${i}.gff3 --annotation_format bakta --name ${i} -o amrfinderplus-unicycler/${i}-amrfinderplus-output.txt -O Staphylococcus_aureus --threads 6 --protein_output amrfinderplus-unicycler/${i}-protein-seq.fasta --nucleotide_output amrfinderplus-unicycler/${i}-nucl-seq.fasta --plus; done

May be a bakta issue, but perhaps AMRFinderPlus doesn't like the single quotes in the product?

Any help is appreciated.
Thanks.

The text was updated successfully, but these errors were encountered:

vbrover · 2023-05-04T13:59:49Z

Could send us this gff3 file?

cizydorczyk · 2023-05-04T15:32:17Z

Here it is.

Sample1018.gff3.gz

When I removed the single quotes from the offending line (line 229 in this file), AMRFinderPlus ran just fine. Oddly, single quotes in other lines did not cause any issue.

vbrover · 2023-05-04T15:47:44Z

Line 229 is
contig_1 Prodigal CDS 99205 101682 . - 0 ID=BMMFLM_00545;Name=ATPase/5â€™-3â€™ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BMMFLM_00545;product=ATPase/5â€™-3â€™ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);gene=recD;Parent=BMMFLM_00545_gene;inference=ab initio prediction:Prodigal:2.6;Note=COG:COG0507,COG:L,RefSeq:WP_001283316.1,SO:0001217,UniParc:UPI00003B1729,UniRef:UniRef100_A0A8D9YQT4,UniRef:UniRef50_K9B570,UniRef:UniRef90_A0A0D6H7H2

vbrover · 2023-05-04T15:53:06Z

Is this 5â€™-3â€™ normal for your GFF files?

evolarjun · 2023-05-04T17:40:13Z

It looks to me like some kind of smart quote substitution may have struck. Did you happen to open/save the file in a word processor or editor that might have substituted the single ticks?

We recently started being more rigorous and correct about our handling of the GFF3 standard for URL-encoded special characters after discovering that BLAST reacts badly to some characters in the sequence identifiers. If this is a common issue with interpreting BAKTA output then we should figure something out with @oschwengers. Either we change things or he does so that we maintain compatibility.

vbrover · 2023-05-04T19:08:39Z

Could you send us all 3 files (faa, fna, and gff3 files) to test?

cizydorczyk · 2023-05-04T21:30:32Z

Odd that the quotes appear weird -- they don't appear so in geddit or VSCode on my system (Ubuntu 22.04).

I did not open the gff files and re-save them; this happens for 20 files (Unicycler assembly + Bakta annotation) and another 20 (SKESA assembly + Bakta annotation).

Here are the 3 files:

Sample1018.faa.gz
Sample1018.fna.gz
Sample1018.gff3.gz

vbrover · 2023-05-04T21:40:03Z

Fixed in ver. 3.11.13:

$ amrfinder -g Sample1018.gff3 -p Sample1018.faa -n Sample1018.fna -d publish --plus -o aa -a bakta  -O Staphylococcus_aureus --threads 6 --protein_output aa.prot --nucleotide_output aa.nucl
Running: amrfinder -g Sample1018.gff3 -p Sample1018.faa -n Sample1018.fna -d publish --plus -o aa -a bakta -O Staphylococcus_aureus --threads 6 --protein_output aa.prot --nucleotide_output aa.nucl
Software directory: '/home/brovervv/code/amrfinder/'
Software version: 3.11.13
Database directory: '/home/brovervv/work/AMR/AMRFinder/publish'
Database version: 2023-05-04.1
AMRFinder combined translated and protein and mutation search
Running blastp
Running hmmsearch
Running tblastn
Running blastn
Making report
AMRFinder took 12 seconds to complete

oschwengers · 2023-05-05T06:56:15Z

Thanks @evolarjun for the heads up. So, just to be sure, was this a fix within amrfinder or a workaround for Bakta? If there is anything wrong with the Bakta GFF quoting, please let me know, so I can fix and appropriately handle that in Bakta.

evolarjun · 2023-05-05T15:21:17Z

Hi @oschwengers, it looks like the user is reporting output of BAKTA containg an extended character which amrfinderplus didn't handle well. I thought it was unlikely that BAKTA was naming genes with something other than base ASCII, but in case it isn't I thought you should know.

We have an internal fix for AMRFinderPlus and will release it soon to handle this. The characters don't actually impact the part of the GFF that AMRFinderPlus needs. Extensive testing of BLAST showed us that that BLAST can behave differently when identifiers contain certain characters so we added some, probably too extensive, input checking to AMRFinderPlus.

vbrover · 2023-05-06T01:09:47Z

was this a fix within amrfinder or a workaround for Bakta?

Fix within amrfinder.

evolarjun · 2023-05-08T20:22:33Z

@oschwengers I ran a test using the BAKTA web interface and it appears that an annotation of the attached Sample1018.fna.gz does include a line with the UTF-8 character (\xE28099) used for the tick marks in the ATPase/5'-3' helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V). I assume this comes from one of the underlying databases BAKTA uses, and from my reading UTF-8 is formally allowed by the GFF3 spec.

I'm not sure whether you consider this a bug or not, but it's the only place this occurs that I have found. All other 5' and 3' I can find in the GFF output I've seen from BAKTA use the standard ASCII 39 '

Once we finish up the testing we'll release a version of AMRFinderPlus that will handle input like this ok. I just wanted to let you know in case using unicode characters in BAKTA output is something you want to avoid.

oschwengers · 2023-05-09T14:31:56Z

Hi @evolarjun, thanks a lot for letting me know. I'm not quite sure if this should be handled as a bug. If it's covered by the GFF3 spec it should be OK, otherwise, if this is causing issues with several tools, I'll happily do my best to avoid these things. Simplest thing would be to add a product revision rule replacing \xE28099 by a standard ASCII '.

AMRFinderPlus release 3.11.14 This release addresses a few issues brought up on GitHub. We weren't able to solve all of them when we couldn't reproduce them, but we are trying. Changes: - On failure no `-o` output file is created - #115 - AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (relies on gunzip being in PATH) - #61 - AMRFinderPlus does not support unicode, but it no longer checks GFF files to prohibit unicode characters specifically - #119 - Add reporting of curl error messages - #120

evolarjun · 2023-05-10T14:43:58Z

Release 3.11.14 relaxes the checking on GFF files so GFFs containing "5â€™-3â€™ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V)" should now work out of the box.

UTF-8 in general is not supported however, so we don't make guarantees on how AMRFinderPlus will handle other characters.

This release addresses a few issues brought up on GitHub. Changes: - On failure no `-o` output file is created - ncbi/amr#115 - AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (this relies on gunzip being in PATH) - ncbi/amr#61 - AMRFinderPlus does not support unicode, but it will not check GFF files to prohibit extended ASCII or UTF-8 characters specifically (still prohibits GFF files with ASCII control characters 0x00 and 0x1F) - ncbi/amr#119 - Add reporting of curl error messages - ncbi/amr#120

evolarjun added the bug Something isn't working label May 9, 2023

evolarjun closed this as completed May 10, 2023

evolarjun mentioned this issue May 10, 2023

Release 3.11.14 #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gff file mismatch -- non-printable character #119

Gff file mismatch -- non-printable character #119

cizydorczyk commented May 4, 2023 •

edited

Loading

vbrover commented May 4, 2023

cizydorczyk commented May 4, 2023

vbrover commented May 4, 2023

vbrover commented May 4, 2023

evolarjun commented May 4, 2023

vbrover commented May 4, 2023

cizydorczyk commented May 4, 2023 •

edited

Loading

vbrover commented May 4, 2023

oschwengers commented May 5, 2023

evolarjun commented May 5, 2023 •

edited

Loading

vbrover commented May 6, 2023

evolarjun commented May 8, 2023

oschwengers commented May 9, 2023

evolarjun commented May 10, 2023

Gff file mismatch -- non-printable character #119

Gff file mismatch -- non-printable character #119

Comments

cizydorczyk commented May 4, 2023 • edited Loading

vbrover commented May 4, 2023

cizydorczyk commented May 4, 2023

vbrover commented May 4, 2023

vbrover commented May 4, 2023

evolarjun commented May 4, 2023

vbrover commented May 4, 2023

cizydorczyk commented May 4, 2023 • edited Loading

vbrover commented May 4, 2023

oschwengers commented May 5, 2023

evolarjun commented May 5, 2023 • edited Loading

vbrover commented May 6, 2023

evolarjun commented May 8, 2023

oschwengers commented May 9, 2023

evolarjun commented May 10, 2023

cizydorczyk commented May 4, 2023 •

edited

Loading

cizydorczyk commented May 4, 2023 •

edited

Loading

evolarjun commented May 5, 2023 •

edited

Loading