-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastq groomer does not parse sequence ids that contain spaces properly #130
Comments
Is this - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy_utils/sequence/fastq.py - the code? I think it has been re-packaged and is available as a tool shed package. If yes, I guess Galaxy should be considered the canonical home and "we" will have to create new recipes for updates. |
That is likely the place, though that is probably legacy code. The toolshed version gets the source from http://depot.galaxyproject.org/package/source/galaxy_sequence_utils/galaxy_sequence_utils-1.0.0.tgz |
The error message is correct. Those files are counter to the "spec" as written in http://dx.doi.org/10.1093/nar/gkp1137 The Is there a bug tracker for the iMicrobe Project to report this? |
Actually, re-reading, the message is confusing (but it should still be an error). |
Hmmm... if that paper is the best we have as a specification, it seems unclear:
Given that ambiguity, I would say this is less of a bug and more of feature request then. It seems, that if the ID matches, but the description does not, it would make sense to accept it. "Be conservative in what you send, be liberal in what you accept" Though, I agree, they are poorly formatted files, and I'll see what I can do to submit a bug report. |
The "groomer" is meant to sanitize FASTQ input, so this would be a reasonable feature request. (And sorry if the paper wasn't as clear as it could have been on this) |
No problem, the paper is actually quite a nice description of the formats used. It's pretty hard to write a specification for a file format that already exists... ;-) Also, I was able to submit feedback to the iMicrobe site regarding those files, so we'll see what sort of response I get, but I don't expect them to actually change the existing "archived" files. |
I suppose that the english is slightly off and the error could perhaps be more readable as Due to 'there is no explicit limitation on the characters expected' for sequence content and since we can have possibly uneven line-wrapping between and within blocks, we actually cannot parse the provided example as FASTQ and adhere to the spec, without a user explicitly making additional declarations, such as 'simple 4 line blocks' and 'ignore inconsistent identifiers'. |
For what it's worth, EMBOSS 6.6.0.0
I didn't look through the code to see how they handle things, but it seems reasonable to make some assumptions in this case. One option would be to parse the READID and DESCRIPTION separately, and match on READID. This is extremely unlikely to be erroneous. BTW, where is the code for this, besides being in Galaxy core? |
@lparsons we are working currently on EMBOSS wrappers. We have already the binaries: galaxyproject/tools-iuc#65 and @erasche and me are working to get these automatically converted with different strategies. EMBOSS ACD is a really difficult format to deal with. |
@lparsons Yep. Packages are available here: http://gx.hx42.org/job/Docker-Build/ and @natefoo is (hopefully going to be) mirroring some of the docker-build stuff onto depot as noted in galaxyproject/starforge#10 My WIP for automated emboss wrappers will be in this PR if you wish to follow that. galaxyproject/tools-iuc#131 |
Any chance of this feature being implemented or perhaps the EMBOSS |
The issue is that the FastqReader class parses the entire "@" line as a sequence id. The specification actually allows for an "optional" description, similar to FASTA which comes after a space. Some files I have seen (such as those from the iMicrobe Project) contain lines like:
These result in the following error:
I believe this would be a simple fix, however, I'm not actually sure where the code for those utility classes is hosted.
The text was updated successfully, but these errors were encountered: