Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastq groomer does not parse sequence ids that contain spaces properly #130

Open
lparsons opened this issue May 8, 2015 · 12 comments
Open
Labels

Comments

@lparsons
Copy link
Contributor

lparsons commented May 8, 2015

The issue is that the FastqReader class parses the entire "@" line as a sequence id. The specification actually allows for an "optional" description, similar to FASTA which comes after a space. Some files I have seen (such as those from the iMicrobe Project) contain lines like:

@<READID> <DESCRIPTION>
ATCGGTTTCGTTGTGTTATTCGCGGCCAAGGGTTTTGTCGTCGTTATATT
+<READID>
^__ec\cccgce]`J[R`[[`ee][ccecag__aafU\baT_WLY^\Xac

These result in the following error:

Exception: Invalid FASTQ file: could not find quality score of sequence identifier @<READID> <DESCRIPTION>.

I believe this would be a simple fix, however, I'm not actually sure where the code for those utility classes is hosted.

@jmchilton
Copy link
Member

Is this - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy_utils/sequence/fastq.py - the code? I think it has been re-packaged and is available as a tool shed package. If yes, I guess Galaxy should be considered the canonical home and "we" will have to create new recipes for updates.

@lparsons
Copy link
Contributor Author

lparsons commented May 8, 2015

@peterjc
Copy link
Contributor

peterjc commented May 8, 2015

The error message is correct. Those files are counter to the "spec" as written in http://dx.doi.org/10.1093/nar/gkp1137

The + line should be just + or include the FULL string as used on the @ line.

Is there a bug tracker for the iMicrobe Project to report this?

@peterjc peterjc closed this as completed May 8, 2015
@peterjc peterjc reopened this May 8, 2015
@peterjc
Copy link
Contributor

peterjc commented May 8, 2015

Actually, re-reading, the message is confusing (but it should still be an error).

@lparsons
Copy link
Contributor Author

lparsons commented May 8, 2015

Hmmm... if that paper is the best we have as a specification, it seems unclear:

Third, to signal the end of the sequence lines and the start of the quality string, comes the ‘+’ line.
Originally this also included a full repeat of the title line text (as shown in the NCBI example above);
however, by common usage and the MAQ tool convention, this is optional and the ‘+’ line can
contain just this one character, reducing the file size significantly. The OBF tools follow this MAQ
convention on output, and omit the optional repeated title text.

Given that ambiguity, I would say this is less of a bug and more of feature request then. It seems, that if the ID matches, but the description does not, it would make sense to accept it. "Be conservative in what you send, be liberal in what you accept"

Though, I agree, they are poorly formatted files, and I'll see what I can do to submit a bug report.

@peterjc
Copy link
Contributor

peterjc commented May 8, 2015

The "groomer" is meant to sanitize FASTQ input, so this would be a reasonable feature request.

(And sorry if the paper wasn't as clear as it could have been on this)

@lparsons
Copy link
Contributor Author

lparsons commented May 8, 2015

No problem, the paper is actually quite a nice description of the formats used. It's pretty hard to write a specification for a file format that already exists... ;-)

Also, I was able to submit feedback to the iMicrobe site regarding those files, so we'll see what sort of response I get, but I don't expect them to actually change the existing "archived" files.

@blankenberg
Copy link
Member

I suppose that the english is slightly off and the error could perhaps be more readable as could not find quality score(s) for sequence identified as "@<READID> <DESCRIPTION>"., as this is literally what is happening. The entirety of the file from this read onward is being parsed as the sequence content.

Due to 'there is no explicit limitation on the characters expected' for sequence content and since we can have possibly uneven line-wrapping between and within blocks, we actually cannot parse the provided example as FASTQ and adhere to the spec, without a user explicitly making additional declarations, such as 'simple 4 line blocks' and 'ignore inconsistent identifiers'.

@lparsons
Copy link
Contributor Author

lparsons commented May 8, 2015

For what it's worth, EMBOSS 6.6.0.0 seqret worked like a charm on these (and pretty fast). Is there any work being done on the EMBOSS wrapper to include this functionality?

seqret fastq-illumina::SAMPLE.fastq fastq-sanger::SAMPLE.sanger.fastq

I didn't look through the code to see how they handle things, but it seems reasonable to make some assumptions in this case. One option would be to parse the READID and DESCRIPTION separately, and match on READID. This is extremely unlikely to be erroneous.

BTW, where is the code for this, besides being in Galaxy core?

@bgruening
Copy link
Member

@lparsons we are working currently on EMBOSS wrappers. We have already the binaries: galaxyproject/tools-iuc#65 and @erasche and me are working to get these automatically converted with different strategies. EMBOSS ACD is a really difficult format to deal with.

@hexylena
Copy link
Member

hexylena commented May 8, 2015

@lparsons Yep. Packages are available here: http://gx.hx42.org/job/Docker-Build/ and @natefoo is (hopefully going to be) mirroring some of the docker-build stuff onto depot as noted in galaxyproject/starforge#10

My WIP for automated emboss wrappers will be in this PR if you wish to follow that. galaxyproject/tools-iuc#131

@nsoranzo nsoranzo added the bug label May 20, 2015
@lparsons
Copy link
Contributor Author

lparsons commented Aug 2, 2017

Any chance of this feature being implemented or perhaps the EMBOSS seqret wrapper being updated to handle fastq format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants