fastq groomer does not parse sequence ids that contain spaces properly #130

lparsons · 2015-05-08T15:36:44Z

The issue is that the FastqReader class parses the entire "@" line as a sequence id. The specification actually allows for an "optional" description, similar to FASTA which comes after a space. Some files I have seen (such as those from the iMicrobe Project) contain lines like:

@<READID> <DESCRIPTION>
ATCGGTTTCGTTGTGTTATTCGCGGCCAAGGGTTTTGTCGTCGTTATATT
+<READID>
^__ec\cccgce]`J[R`[[`ee][ccecag__aafU\baT_WLY^\Xac

These result in the following error:

Exception: Invalid FASTQ file: could not find quality score of sequence identifier @<READID> <DESCRIPTION>.

I believe this would be a simple fix, however, I'm not actually sure where the code for those utility classes is hosted.

The text was updated successfully, but these errors were encountered:

jmchilton · 2015-05-08T15:42:36Z

Is this - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy_utils/sequence/fastq.py - the code? I think it has been re-packaged and is available as a tool shed package. If yes, I guess Galaxy should be considered the canonical home and "we" will have to create new recipes for updates.

lparsons · 2015-05-08T15:46:18Z

That is likely the place, though that is probably legacy code. The toolshed version gets the source from http://depot.galaxyproject.org/package/source/galaxy_sequence_utils/galaxy_sequence_utils-1.0.0.tgz

See: https://github.com/galaxyproject/tools-devteam/blob/master/packages/package_galaxy_utils_1_0/tool_dependencies.xml#L6

peterjc · 2015-05-08T15:47:35Z

The error message is correct. Those files are counter to the "spec" as written in http://dx.doi.org/10.1093/nar/gkp1137

The + line should be just + or include the FULL string as used on the @ line.

Is there a bug tracker for the iMicrobe Project to report this?

peterjc · 2015-05-08T15:48:21Z

Actually, re-reading, the message is confusing (but it should still be an error).

lparsons · 2015-05-08T16:01:01Z

Hmmm... if that paper is the best we have as a specification, it seems unclear:

Third, to signal the end of the sequence lines and the start of the quality string, comes the ‘+’ line.
Originally this also included a full repeat of the title line text (as shown in the NCBI example above);
however, by common usage and the MAQ tool convention, this is optional and the ‘+’ line can
contain just this one character, reducing the file size significantly. The OBF tools follow this MAQ
convention on output, and omit the optional repeated title text.

Given that ambiguity, I would say this is less of a bug and more of feature request then. It seems, that if the ID matches, but the description does not, it would make sense to accept it. "Be conservative in what you send, be liberal in what you accept"

Though, I agree, they are poorly formatted files, and I'll see what I can do to submit a bug report.

peterjc · 2015-05-08T16:04:38Z

The "groomer" is meant to sanitize FASTQ input, so this would be a reasonable feature request.

(And sorry if the paper wasn't as clear as it could have been on this)

lparsons · 2015-05-08T16:10:15Z

No problem, the paper is actually quite a nice description of the formats used. It's pretty hard to write a specification for a file format that already exists... ;-)

Also, I was able to submit feedback to the iMicrobe site regarding those files, so we'll see what sort of response I get, but I don't expect them to actually change the existing "archived" files.

blankenberg · 2015-05-08T20:52:03Z

I suppose that the english is slightly off and the error could perhaps be more readable as could not find quality score(s) for sequence identified as "@<READID> <DESCRIPTION>"., as this is literally what is happening. The entirety of the file from this read onward is being parsed as the sequence content.

Due to 'there is no explicit limitation on the characters expected' for sequence content and since we can have possibly uneven line-wrapping between and within blocks, we actually cannot parse the provided example as FASTQ and adhere to the spec, without a user explicitly making additional declarations, such as 'simple 4 line blocks' and 'ignore inconsistent identifiers'.

lparsons · 2015-05-08T21:01:13Z

For what it's worth, EMBOSS 6.6.0.0 seqret worked like a charm on these (and pretty fast). Is there any work being done on the EMBOSS wrapper to include this functionality?

seqret fastq-illumina::SAMPLE.fastq fastq-sanger::SAMPLE.sanger.fastq

I didn't look through the code to see how they handle things, but it seems reasonable to make some assumptions in this case. One option would be to parse the READID and DESCRIPTION separately, and match on READID. This is extremely unlikely to be erroneous.

BTW, where is the code for this, besides being in Galaxy core?

bgruening · 2015-05-08T22:14:45Z

@lparsons we are working currently on EMBOSS wrappers. We have already the binaries: galaxyproject/tools-iuc#65 and @erasche and me are working to get these automatically converted with different strategies. EMBOSS ACD is a really difficult format to deal with.

hexylena · 2015-05-08T22:19:26Z

@lparsons Yep. Packages are available here: http://gx.hx42.org/job/Docker-Build/ and @natefoo is (hopefully going to be) mirroring some of the docker-build stuff onto depot as noted in galaxyproject/starforge#10

My WIP for automated emboss wrappers will be in this PR if you wish to follow that. galaxyproject/tools-iuc#131

lparsons · 2017-08-02T16:49:32Z

Any chance of this feature being implemented or perhaps the EMBOSS seqret wrapper being updated to handle fastq format?

peterjc closed this as completed May 8, 2015

peterjc reopened this May 8, 2015

nsoranzo added the bug label May 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastq groomer does not parse sequence ids that contain spaces properly #130

fastq groomer does not parse sequence ids that contain spaces properly #130

lparsons commented May 8, 2015

jmchilton commented May 8, 2015

lparsons commented May 8, 2015

peterjc commented May 8, 2015

peterjc commented May 8, 2015

lparsons commented May 8, 2015

peterjc commented May 8, 2015

lparsons commented May 8, 2015

blankenberg commented May 8, 2015

lparsons commented May 8, 2015

bgruening commented May 8, 2015

hexylena commented May 8, 2015

lparsons commented Aug 2, 2017

fastq groomer does not parse sequence ids that contain spaces properly #130

fastq groomer does not parse sequence ids that contain spaces properly #130

Comments

lparsons commented May 8, 2015

jmchilton commented May 8, 2015

lparsons commented May 8, 2015

peterjc commented May 8, 2015

peterjc commented May 8, 2015

lparsons commented May 8, 2015

peterjc commented May 8, 2015

lparsons commented May 8, 2015

blankenberg commented May 8, 2015

lparsons commented May 8, 2015

bgruening commented May 8, 2015

hexylena commented May 8, 2015

lparsons commented Aug 2, 2017