Run in local with docker - error waiting for container read: operation timed out #3370

magicDGS · 2018-03-07T13:59:27Z

When trying to run a workflow locally with docker containers, I found that sometime (not always, and depends on the context, and not always in the same part of the pipeline) there is an error about time out operation.

It looks that some tasks that take longer does not get a response for the container (although it is still running) and thus cromwell assumes a failure (because docker returns -1 although it is still running) and the workflow finishes with errors. In the logs for the task, embedded into the standard error from the operations, I get the following signature:

time="2018-03-07T14:17:55+01:00" level=error msg="error waiting for container: read tcp 192.168.99.1:56961->192.168.99.101:2376: read: operation timed out"

And the rc file is marked with -1. I cannot continue on this return code, because the task is still running on the container and continuing assumes that the operation is finished.

My local configuration file looks like this:

include required(classpath("application"))

## keep always the workflow logs
workflow-options.workflow-log-temporary: false

backend.providers.Local.config {
    ## limit the number of jobs
    concurrent-job-limit = 10
    filesystems.local {
        ## do not allow copy (huge files)
        ## prefer hard-links
        localization: ["hard-link", "soft-link"]
        caching.duplication-strategy: ["hard-link", "soft-link"]
    }
}

And the cromwell command is (using a brew installed wrapper):

JAVA_OPTS="-Dconfig.file=local.conf" cromwell run --inputs inputs.json --metadata-output metadata-output.json workflow.wdl

This error is happening for different workflows and tasks, so it is very difficult to account for it. In addition, a long-run workflow stops for this and requires a retry of the whole pipeline in my system, so it is really a problem when trying to run a time-consuming workflow that requires re-start for non-real failures.

Is there any way that the local backend (or any backend) catch the docker timeout failures and re-attach? Or maybe that the script.submit or script.backgound checks that the container is really stop and finished before returning a misleading error code?

Thank you in advance!

The text was updated successfully, but these errors were encountered:

magicDGS · 2018-03-07T14:25:51Z

I think that this is related with docker/machine#2517, but I believe that cromwell can be more robust to a container still running but detached due to timeout.

geoffjentry · 2018-03-07T17:05:07Z

This is interesting @magicDGS - thanks for the report. We'll try to take a look at it in the not too distant future.

magicDGS · 2018-03-07T17:08:01Z

@geoffjentry - thanks for the quick answer. Looking forward to have fixed this!

magicDGS · 2018-03-08T10:34:25Z

I came out with a custom and dirty way of going around this issue. In my configuration file, I changed the backend.providers.Local.config.submit-docker script for the following:

# run as in the original configuration without --rm flag (will remove later)
docker run \
  --cidfile ${docker_cid} \
  -i \
  ${"--user " + docker_user} \
  --entrypoint /bin/bash \
  -v ${cwd}:${docker_cwd} \
  ${docker} ${script}

# get the return code (working even if the container was detached)
rc=$(docker wait `cat ${docker_cid}`)

# remove the container after waiting
docker rm `cat ${docker_cid}`

# return exit code
exit $rc

Maybe this could be the default value in the reference configuration file to solve the problem, but maybe it is better to have a post-docker configuration which is added to the pipeline similar to the script-epilogue. This would make easier the configuration of docker runs, separating submission and checks.

By now, I will use the following local configuration to continue my work with the cromwell runner:

include required(classpath("application"))

## keep always the workflow logs
workflow-options.workflow-log-temporary: false

backend.providers.Local.config {
    ## limit the number of jobs
    concurrent-job-limit = 15
    # set the root directory to the run
    filesystems.local {
        ## do not allow copy (huge files)
        localization: ["hard-link", "soft-link"]
        caching.duplication-strategy: ["hard-link", "soft-link"]
    }
    # custom submit-docker to workaround detached container due to timeout in the virtual machine
    # first, we do not remove the container until it really finishes (no --rm flag)
    # if the docker run command fails, then it runs docker wait to wait until it finishes and store the return code
    # if the docker run command fails, then it runs docker wait to return the real exit code even if detached
    # once it finishes, removes the docker container with docker rm
    # finally, returns the "real return code" stored
    submit-docker = """
        docker run \
          --cidfile ${docker_cid} \
          -i \
          ${"--user " + docker_user} \
          --entrypoint /bin/bash \
          -v ${cwd}:${docker_cwd} \
          ${docker} ${script}
        rc=$(docker wait `cat ${docker_cid}`)
        docker rm `cat ${docker_cid}`
        exit $rc
    """
}

By the way, it looks like the configuration of the local backend in the docs is still under development (http://cromwell.readthedocs.io/en/develop/tutorials/LocalBackendIntro/). I think that this kind of things can be part of the docs if not included as default in the source code - let me know if I can do something to help documenting the local end, which I am using as my default one.

danbills · 2018-03-17T18:16:43Z

@magicDGS Thanks for reporting! The fix will be out with cromwell 32.

magicDGS · 2018-03-19T13:53:15Z

Thank you for including my fix.

Just to let you know, I realized that the stderr/stdout would not be included in the cromwell output for the task if the container is detached - thus, a better option is to re-attach somehow (I didn't explore the idea). Maybe worthy to look at for cromwell 33 (should I open a new issue for that?)

…ocker cid file indicates completion wait to remove docker images until docker cid file indicates completion see: broadinstitute/cromwell#3370 (comment)

geoffjentry added the User Requested Improvement label Mar 7, 2018

danbills self-assigned this Mar 7, 2018

This was referenced Mar 15, 2018

How to clean up MySQL database for a concrete workflow run? #3415

Open

Add options for retry calls #3417

Closed

danbills closed this as completed Mar 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run in local with docker - error waiting for container read: operation timed out #3370

Run in local with docker - error waiting for container read: operation timed out #3370

magicDGS commented Mar 7, 2018 •

edited

Loading

magicDGS commented Mar 7, 2018

geoffjentry commented Mar 7, 2018

magicDGS commented Mar 7, 2018

magicDGS commented Mar 8, 2018

danbills commented Mar 17, 2018

magicDGS commented Mar 19, 2018

Run in local with docker - error waiting for container read: operation timed out #3370

Run in local with docker - error waiting for container read: operation timed out #3370

Comments

magicDGS commented Mar 7, 2018 • edited Loading

magicDGS commented Mar 7, 2018

geoffjentry commented Mar 7, 2018

magicDGS commented Mar 7, 2018

magicDGS commented Mar 8, 2018

danbills commented Mar 17, 2018

magicDGS commented Mar 19, 2018

magicDGS commented Mar 7, 2018 •

edited

Loading