-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNV plotting tools may fail within GATK Docker for large inputs because of limited tmpfs space. #4140
Comments
You would not have access to docker container options when using the Google backend because the running of your image is all controlled by Pipelines API. You would be able to set that value when running on a local backend but thats probably not portable enough for your workflow. |
OK. @LeeTL1220 How do you want to handle this? I'd strongly prefer to stick with data.table despite the GitHub issue above, since it's much faster than the usual read.table (e.g., 4 seconds vs. 45 seconds for your ~9.7M row WGS copy-ratio TSV that originally caused the error). Is there any other way we can increase /dev/shm size? @droazen @jamesemery @lbergelson Any thoughts? |
I hate to say it: "45 seconds but works with WGS" is better than "4
seconds but doesn't work with WGS"
I'm open to suggestions, though.
…On Fri, Jan 12, 2018 at 2:29 PM, samuelklee ***@***.***> wrote:
OK. @LeeTL1220 <https://github.com/leetl1220> How do you want to handle
this? I'd strongly prefer to stick with data.table despite the GitHub issue
above, since it's much faster than the usual read.table (e.g., 4 seconds
vs. 45 seconds for your ~9.7M row WGS copy-ratio TSV that originally caused
the error).
Is there any other way we can increase /dev/shm size? @droazen
<https://github.com/droazen> @jamesemery <https://github.com/jamesemery>
@lbergelson <https://github.com/lbergelson> Any thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4140 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXkyJk2I2yxJOL7pV1UMN7egBCW7blks5tJ7GLgaJpZM4RclpR>
.
--
Lee Lichtenstein
Broad Institute
75 Ames Street, Room 8011A
Cambridge, MA 02142
617 714 8632
|
To be clear, this will work perfectly fine as long as you have enough space in /dev/shm---which is typically true everywhere outside of our default Docker container. I'm loath to cripple a tool just because of limitations that are fundamentally elsewhere...let's just address those in the appropriate places. (Furthermore, I'm especially loath to write a plotting tool that takes ~5 minutes to generate a plot!) And yes, while it is not great that data.table forces us to use /dev/shm, I think If If there really is no other way around it, then all we're doing is filtering out the lines beginning with |
We've had to do that in other places...
…On Fri, Jan 12, 2018 at 3:20 PM, samuelklee ***@***.***> wrote:
To be clear, this will work perfectly fine as long as you have enough
space in /dev/shm---which is typically true everywhere outside of our
default Docker container.
I'm loath to cripple a tool just because of limitations that are
fundamentally elsewhere...let's just address those in the appropriate
places. (Furthermore, I'm especially loath to write a plotting tool that
takes ~5 minutes to generate a plot!) And yes, while it is not great that
data.table forces us to use /dev/shm, I think fread("grep ...") is
relatively standard.
If --shm-size is indeed not exposed, why doesn't the Google backend scale
/dev/shm or other tmpfs space with requested machine memory?
If there really is no other way around it, then all we're doing is
filtering out the lines beginning with @. We could do this first by
calling system commands within R to write to a temporary file, and then
reading that back in with fread. This seems hacky to me, but I've confirmed
that it works within the Docker. This will solve our immediate problem, but
I still think it's worth taking a look at those other limitations elsewhere
now as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4140 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXkxNvcMcJfIhdlPhdU3vLHTiAVPPSks5tJ76mgaJpZM4RclpR>
.
--
Lee Lichtenstein
Broad Institute
75 Ames Street, Room 8011A
Cambridge, MA 02142
617 714 8632
|
Opened a PR. However, I think it's reasonable to ask why we don't have the equivalent of |
@samuelklee What do you mean by "submit-docker"? Do you want something baked into gatk-launch? |
@lbergelson Sorry to be unclear---this isn't a GATK issue. For Cromwell, you can configure various options for each backend. For example, if you are running on a local backend with Docker, you can set a If that's the case, then this is more of an issue with the Cromwell/Google Pipelines interface than the data.table package (although, as the discussion in the GitHub issue above shows, it'd be a simple fix on the data.table end, so I'm not sure why it's not addressed yet...) Changing the R script to get around the issue in this particular case is not unacceptably ugly, but you could imagine we might run into a similar problem in the future if anything else exceeds the 64MB /dev/shm limit and also cannot specify tmpfs. So perhaps we should take a look at the underlying issue. |
These tools use fread + grep preprocessing (
fread("grep ...")
) to quickly read in large TSVs in the backend R scripts. Unfortunately, because of Rdatatable/data.table#1139 and the fact that /dev/shm is limited to 64MB in a standard GATK Docker container, this can yield the following error when running within Docker:Starting a Docker container with a sufficiently large
--shm-size
resolves this, but I am not sure if we can access this via standard runtime attributes in Cromwell. Not sure if we'd want to increase this in the Docker image itself either.The text was updated successfully, but these errors were encountered: