-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gff file mismatch -- non-printable character #119
Comments
Could send us this gff3 file? |
Here it is. When I removed the single quotes from the offending line (line 229 in this file), AMRFinderPlus ran just fine. Oddly, single quotes in other lines did not cause any issue. |
Line 229 is |
Is this |
It looks to me like some kind of smart quote substitution may have struck. Did you happen to open/save the file in a word processor or editor that might have substituted the single ticks? We recently started being more rigorous and correct about our handling of the GFF3 standard for URL-encoded special characters after discovering that BLAST reacts badly to some characters in the sequence identifiers. If this is a common issue with interpreting BAKTA output then we should figure something out with @oschwengers. Either we change things or he does so that we maintain compatibility. |
Could you send us all 3 files (faa, fna, and gff3 files) to test? |
Odd that the quotes appear weird -- they don't appear so in geddit or VSCode on my system (Ubuntu 22.04). I did not open the gff files and re-save them; this happens for 20 files (Unicycler assembly + Bakta annotation) and another 20 (SKESA assembly + Bakta annotation). Here are the 3 files: |
Fixed in ver. 3.11.13:
|
Thanks @evolarjun for the heads up. So, just to be sure, was this a fix within |
Hi @oschwengers, it looks like the user is reporting output of BAKTA containg an extended character which amrfinderplus didn't handle well. I thought it was unlikely that BAKTA was naming genes with something other than base ASCII, but in case it isn't I thought you should know. We have an internal fix for AMRFinderPlus and will release it soon to handle this. The characters don't actually impact the part of the GFF that AMRFinderPlus needs. Extensive testing of BLAST showed us that that BLAST can behave differently when identifiers contain certain characters so we added some, probably too extensive, input checking to AMRFinderPlus. |
Fix within amrfinder. |
@oschwengers I ran a test using the BAKTA web interface and it appears that an annotation of the attached Sample1018.fna.gz does include a line with the UTF-8 character (\xE28099) used for the tick marks in the ATPase/5'-3' helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V). I assume this comes from one of the underlying databases BAKTA uses, and from my reading UTF-8 is formally allowed by the GFF3 spec. I'm not sure whether you consider this a bug or not, but it's the only place this occurs that I have found. All other 5' and 3' I can find in the GFF output I've seen from BAKTA use the standard ASCII 39 ' Once we finish up the testing we'll release a version of AMRFinderPlus that will handle input like this ok. I just wanted to let you know in case using unicode characters in BAKTA output is something you want to avoid. |
Hi @evolarjun, thanks a lot for letting me know. I'm not quite sure if this should be handled as a bug. If it's covered by the GFF3 spec it should be OK, otherwise, if this is causing issues with several tools, I'll happily do my best to avoid these things. Simplest thing would be to add a product revision rule replacing \xE28099 by a standard ASCII |
AMRFinderPlus release 3.11.14 This release addresses a few issues brought up on GitHub. We weren't able to solve all of them when we couldn't reproduce them, but we are trying. Changes: - On failure no `-o` output file is created - #115 - AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (relies on gunzip being in PATH) - #61 - AMRFinderPlus does not support unicode, but it no longer checks GFF files to prohibit unicode characters specifically - #119 - Add reporting of curl error messages - #120
Release 3.11.14 relaxes the checking on GFF files so GFFs containing "5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V)" should now work out of the box. UTF-8 in general is not supported however, so we don't make guarantees on how AMRFinderPlus will handle other characters. |
This release addresses a few issues brought up on GitHub. Changes: - On failure no `-o` output file is created - ncbi/amr#115 - AMRFinderPlus will now automatically decompress files ending in .gz with gunzip (this relies on gunzip being in PATH) - ncbi/amr#61 - AMRFinderPlus does not support unicode, but it will not check GFF files to prohibit extended ASCII or UTF-8 characters specifically (still prohibits GFF files with ASCII control characters 0x00 and 0x1F) - ncbi/amr#119 - Add reporting of curl error messages - ncbi/amr#120
When I try running AMRFinderPlus on Bakta output (faa, fna, and gff3 files), I get the following error:
And the offending line looks like:
contig_1 Prodigal CDS 99205 101682 . - 0 ID=BMMFLM_00545;Name=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);locus_tag=BMMFLM_00545;product=ATPase/5’-3’ helicase helicase subunit RecD of the DNA repair enzyme RecBCD (exonuclease V);gene=recD;Parent=BMMFLM_00545_gene;inference=ab initio prediction:Prodigal:2.6;Note=COG:COG0507,COG:L,RefSeq:WP_001283316.1,SO:0001217,UniParc:UPI00003B1729,UniRef:UniRef100_A0A8D9YQT4,UniRef:UniRef50_K9B570,UniRef:UniRef90_A0A0D6H7H2
This has happened for several S. aureus genomes (20 genomes), and the offending line is always this annotated gene.
The command to run AMRFinderPlus was:
$ for i in $(cat bsi-isolate-list.txt); do amrfinder -p bakta-unicycler/${i}/${i}.faa -n bakta-unicycler/${i}/${i}.fna -g bakta-unicycler/${i}/${i}.gff3 --annotation_format bakta --name ${i} -o amrfinderplus-unicycler/${i}-amrfinderplus-output.txt -O Staphylococcus_aureus --threads 6 --protein_output amrfinderplus-unicycler/${i}-protein-seq.fasta --nucleotide_output amrfinderplus-unicycler/${i}-nucl-seq.fasta --plus; done
May be a bakta issue, but perhaps AMRFinderPlus doesn't like the single quotes in the product?
Any help is appreciated.
Thanks.
The text was updated successfully, but these errors were encountered: