Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept piped input FastqToSam #915

Closed
yesimon opened this issue Aug 29, 2017 · 7 comments
Closed

Accept piped input FastqToSam #915

yesimon opened this issue Aug 29, 2017 · 7 comments

Comments

@yesimon
Copy link

yesimon commented Aug 29, 2017

Feature request

Tool(s) involved

SAM/BAM, Metrics etc.

Description

Allow Picard to accept piped input for commands working with large files

Currently this is not possible for most tools operating on SAM/BAM input because of file type autodetection by reading the beginning of the file and then .reset() on the InputStream, which cannot be done on /dev/stdin or named pipe input. Similarly FastqToSam does autodetection of quality score format and the reopens the stream. However, by using something like RereadableInputStream https://tika.apache.org/1.2/api/org/apache/tika/utils/RereadableInputStream.html, picard can support input pipes combined with file autodetection.

@nh13
Copy link
Collaborator

nh13 commented Aug 29, 2017

In some cases, special built tools were built to allow streaming (ex. MarkDuplicatesWithMateCigar versus MarkDuplicates), while in other cases, the tools could be modified on a case-by-case basis if there is a valid case and enough developer interest. @yesimon was FastqToSam the only one you were interested in? Happy to give some direction if you are interested in submitting a PR.

@yesimon
Copy link
Author

yesimon commented Aug 30, 2017

The most basic would be letting FastqToSam support piped input. Right now it has to do quality score format autodetection (one non-backwards compatible change is setting QUALITY_FORMAT=Standard actually just sets the quality instead of providing something to sanity check against). More broadly, the SamReaderFactory of htsjdk can wrap around RereadableInputStream with a very small performance penalty (probably insignificant), allowing piped SAM. VCFs tend to be smaller so there's less of a need there.

I don't see how any of the MarkDuplicates... functions allow piped input right now. They all seem to be using the SamReaderFactory class which does autodetection. It is a one-pass program which means it should be quite amenable to piped inputs.

@nh13
Copy link
Collaborator

nh13 commented Aug 30, 2017

MarkDuplicates is not one pass... sorry. Also, a lot of tools allow piped SAM. I'd be excited for some more benchmarks on the implementation you propose. Likely folks are busy with their own work (i.e. Likely They don't get paid to support Picard)so if your up for a PR, I am sure I am not out of line to suggest folks would be willing to review. But that's better left to the Broad Picard folks.

@yfarjoun
Copy link
Contributor

yfarjoun commented Aug 30, 2017 via email

@yesimon yesimon changed the title Accept piped input Accept piped input FastqToSam Sep 5, 2017
@yesimon
Copy link
Author

yesimon commented Sep 5, 2017

It turns out that most tools accepting sam inputs do work on piped input (which all boils down to checking InputStream.markSupported() on the input). In that case this limits this to just FastqToSam. My thoughts are to not do quality score sanity-checking if the QUALITY_FORMAT is provided and the input is a pipe.

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants