Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run in local with docker - error waiting for container read: operation timed out #3370

Closed
magicDGS opened this issue Mar 7, 2018 · 6 comments

Comments

@magicDGS
Copy link

magicDGS commented Mar 7, 2018

When trying to run a workflow locally with docker containers, I found that sometime (not always, and depends on the context, and not always in the same part of the pipeline) there is an error about time out operation.

It looks that some tasks that take longer does not get a response for the container (although it is still running) and thus cromwell assumes a failure (because docker returns -1 although it is still running) and the workflow finishes with errors. In the logs for the task, embedded into the standard error from the operations, I get the following signature:

time="2018-03-07T14:17:55+01:00" level=error msg="error waiting for container: read tcp 192.168.99.1:56961->192.168.99.101:2376: read: operation timed out"

And the rc file is marked with -1. I cannot continue on this return code, because the task is still running on the container and continuing assumes that the operation is finished.

My local configuration file looks like this:

include required(classpath("application"))

## keep always the workflow logs
workflow-options.workflow-log-temporary: false

backend.providers.Local.config {
    ## limit the number of jobs
    concurrent-job-limit = 10
    filesystems.local {
        ## do not allow copy (huge files)
        ## prefer hard-links
        localization: ["hard-link", "soft-link"]
        caching.duplication-strategy: ["hard-link", "soft-link"]
    }
}

And the cromwell command is (using a brew installed wrapper):

JAVA_OPTS="-Dconfig.file=local.conf" cromwell run --inputs inputs.json --metadata-output metadata-output.json workflow.wdl

This error is happening for different workflows and tasks, so it is very difficult to account for it. In addition, a long-run workflow stops for this and requires a retry of the whole pipeline in my system, so it is really a problem when trying to run a time-consuming workflow that requires re-start for non-real failures.

Is there any way that the local backend (or any backend) catch the docker timeout failures and re-attach? Or maybe that the script.submit or script.backgound checks that the container is really stop and finished before returning a misleading error code?

Thank you in advance!

@magicDGS
Copy link
Author

magicDGS commented Mar 7, 2018

I think that this is related with docker/machine#2517, but I believe that cromwell can be more robust to a container still running but detached due to timeout.

@geoffjentry
Copy link
Contributor

This is interesting @magicDGS - thanks for the report. We'll try to take a look at it in the not too distant future.

@magicDGS
Copy link
Author

magicDGS commented Mar 7, 2018

@geoffjentry - thanks for the quick answer. Looking forward to have fixed this!

@danbills danbills self-assigned this Mar 7, 2018
@magicDGS
Copy link
Author

magicDGS commented Mar 8, 2018

I came out with a custom and dirty way of going around this issue. In my configuration file, I changed the backend.providers.Local.config.submit-docker script for the following:

# run as in the original configuration without --rm flag (will remove later)
docker run \
  --cidfile ${docker_cid} \
  -i \
  ${"--user " + docker_user} \
  --entrypoint /bin/bash \
  -v ${cwd}:${docker_cwd} \
  ${docker} ${script}

# get the return code (working even if the container was detached)
rc=$(docker wait `cat ${docker_cid}`)

# remove the container after waiting
docker rm `cat ${docker_cid}`

# return exit code
exit $rc

Maybe this could be the default value in the reference configuration file to solve the problem, but maybe it is better to have a post-docker configuration which is added to the pipeline similar to the script-epilogue. This would make easier the configuration of docker runs, separating submission and checks.

By now, I will use the following local configuration to continue my work with the cromwell runner:

include required(classpath("application"))

## keep always the workflow logs
workflow-options.workflow-log-temporary: false

backend.providers.Local.config {
    ## limit the number of jobs
    concurrent-job-limit = 15
    # set the root directory to the run
    filesystems.local {
        ## do not allow copy (huge files)
        localization: ["hard-link", "soft-link"]
        caching.duplication-strategy: ["hard-link", "soft-link"]
    }
    # custom submit-docker to workaround detached container due to timeout in the virtual machine
    # first, we do not remove the container until it really finishes (no --rm flag)
    # if the docker run command fails, then it runs docker wait to wait until it finishes and store the return code
    # if the docker run command fails, then it runs docker wait to return the real exit code even if detached
    # once it finishes, removes the docker container with docker rm
    # finally, returns the "real return code" stored
    submit-docker = """
        docker run \
          --cidfile ${docker_cid} \
          -i \
          ${"--user " + docker_user} \
          --entrypoint /bin/bash \
          -v ${cwd}:${docker_cwd} \
          ${docker} ${script}
        rc=$(docker wait `cat ${docker_cid}`)
        docker rm `cat ${docker_cid}`
        exit $rc
    """
}

By the way, it looks like the configuration of the local backend in the docs is still under development (http://cromwell.readthedocs.io/en/develop/tutorials/LocalBackendIntro/). I think that this kind of things can be part of the docs if not included as default in the source code - let me know if I can do something to help documenting the local end, which I am using as my default one.

@danbills
Copy link
Contributor

@magicDGS Thanks for reporting! The fix will be out with cromwell 32.

@magicDGS
Copy link
Author

Thank you for including my fix.

Just to let you know, I realized that the stderr/stdout would not be included in the cromwell output for the task if the container is detached - thus, a better option is to re-attach somehow (I didn't explore the idea). Maybe worthy to look at for cromwell 33 (should I open a new issue for that?)

tomkinsc added a commit to broadinstitute/viral-pipelines that referenced this issue Mar 18, 2024
…ocker cid file indicates completion

wait to remove docker images until docker cid file indicates completion
see:
broadinstitute/cromwell#3370 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants