-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run in local with docker - error waiting for container read: operation timed out #3370
Comments
I think that this is related with docker/machine#2517, but I believe that cromwell can be more robust to a container still running but detached due to timeout. |
This is interesting @magicDGS - thanks for the report. We'll try to take a look at it in the not too distant future. |
@geoffjentry - thanks for the quick answer. Looking forward to have fixed this! |
I came out with a custom and dirty way of going around this issue. In my configuration file, I changed the # run as in the original configuration without --rm flag (will remove later)
docker run \
--cidfile ${docker_cid} \
-i \
${"--user " + docker_user} \
--entrypoint /bin/bash \
-v ${cwd}:${docker_cwd} \
${docker} ${script}
# get the return code (working even if the container was detached)
rc=$(docker wait `cat ${docker_cid}`)
# remove the container after waiting
docker rm `cat ${docker_cid}`
# return exit code
exit $rc Maybe this could be the default value in the reference configuration file to solve the problem, but maybe it is better to have a By now, I will use the following local configuration to continue my work with the cromwell runner:
By the way, it looks like the configuration of the local backend in the docs is still under development (http://cromwell.readthedocs.io/en/develop/tutorials/LocalBackendIntro/). I think that this kind of things can be part of the docs if not included as default in the source code - let me know if I can do something to help documenting the local end, which I am using as my default one. |
@magicDGS Thanks for reporting! The fix will be out with cromwell 32. |
Thank you for including my fix. Just to let you know, I realized that the stderr/stdout would not be included in the cromwell output for the task if the container is detached - thus, a better option is to re-attach somehow (I didn't explore the idea). Maybe worthy to look at for cromwell 33 (should I open a new issue for that?) |
…ocker cid file indicates completion wait to remove docker images until docker cid file indicates completion see: broadinstitute/cromwell#3370 (comment)
When trying to run a workflow locally with docker containers, I found that sometime (not always, and depends on the context, and not always in the same part of the pipeline) there is an error about time out operation.
It looks that some tasks that take longer does not get a response for the container (although it is still running) and thus cromwell assumes a failure (because docker returns -1 although it is still running) and the workflow finishes with errors. In the logs for the task, embedded into the standard error from the operations, I get the following signature:
And the
rc
file is marked with-1
. I cannot continue on this return code, because the task is still running on the container and continuing assumes that the operation is finished.My local configuration file looks like this:
And the cromwell command is (using a
brew
installed wrapper):JAVA_OPTS="-Dconfig.file=local.conf" cromwell run --inputs inputs.json --metadata-output metadata-output.json workflow.wdl
This error is happening for different workflows and tasks, so it is very difficult to account for it. In addition, a long-run workflow stops for this and requires a retry of the whole pipeline in my system, so it is really a problem when trying to run a time-consuming workflow that requires re-start for non-real failures.
Is there any way that the local backend (or any backend) catch the docker timeout failures and re-attach? Or maybe that the
script.submit
orscript.backgound
checks that the container is really stop and finished before returning a misleading error code?Thank you in advance!
The text was updated successfully, but these errors were encountered: