Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WX-1828 Private Docker Hub repos in GCP Batch #7515

Merged
merged 6 commits into from
Sep 27, 2024

Conversation

mcovarr
Copy link
Contributor

@mcovarr mcovarr commented Aug 26, 2024

Description

UPDATE: issues with special characters in passwords appear to be resolved

PR to demo broken private Docker repo support in GCP Batch. There are actually multiple existing PAPI v2 Centaur tests in this vein; the one test enabled here for GCP Batch seems to be the simplest and demonstrates the issues clearly enough.

The crux of this test is that the Docker image that is specified for the task is in a private repo to which the Centaur service account has been granted access. This test passes on PAPI v2 but on GCP Batch jobs fail with messages like the following visible in gcloud batch jobs describe:

Job state is set from RUNNING to FAILED for job projects/1005074806481/locations/us-central1/jobs/job-27607753-d2d5-404d-89af-a786da8ad383.Job
      failed due to task failure. Specifically, task with index 0 failed due to the
      following task event: "Task state is updated from RUNNING to FAILED on zones/us-central1-b/instances/8098872438472929780
      with exit code 125."

Exit code 125 being a typical "something's wrong with that Docker invocation" error.

in Cloud Logging I see the following, including what looks like a plaintext password which I have x'd out below:

Executing runnable container:{image_uri:"broadinstitute/cloud-cromwell@sha256:0d51f90e1dd6a449d4587004c945e43f2a7bbf615151308cff40c15998cc3ad4" commands:"/mnt/disks/cromwell_root/script" entrypoint:"/bin/bash" volumes:"/mnt/disks/cromwell_root:/mnt/disks/cromwell_root" username:"firecloud" password:"xxxxx"} labels:{key:"tag" value:"UserRunnable"} for Task task/job-27607753-d2d5-132dc052-df92-4db100-group0-0/0/0 in TaskGroup group0 of Job job-27607753-d2d5-132dc052-df92-4db100.

So it looks like the GCP Batch backend has acquired and plumbed through the required Docker credentials, but the login to Docker Hub doesn't seem to have happened.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@mcovarr mcovarr changed the title WX-1828 Demo broken private Docker repos in GCP Batch WX-1828 Demo broken private Docker Hub repos in GCP Batch Aug 26, 2024
@dspeck1
Copy link
Collaborator

dspeck1 commented Aug 29, 2024

We had to add prepend docker.io to the image name to get authentication to work so docker.io/broadinstitute/cloud-cromwell

@mcovarr
Copy link
Contributor Author

mcovarr commented Aug 29, 2024

I'm still seeing the same error even when I add the docker.io/ prefix. Confirming the correct command in gcloud batch job describe:

"printf '%s %s\\n' \"$(date -u '+%Y/%m/%d %H:%M:%S')\" Running\\ user\\ runnable:\\ docker\\ run\\ -v\\ /mnt/disks/cromwell_root:/mnt/disks/cromwell_root\\ --entrypoint\\=/bin/bash\\ docker.io/broadinstitute/cloud-cromwell@sha256:0d51f90e1dd6a449d4587004c945e43f2a7bbf615151308cff40c15998cc3ad4\\ /mnt/disks/cromwell_root/script"

Also it would be good not to require a docker.io/ prefix as our users would need to edit all of their WDLs referencing Docker Hub images to be able to run on GCP Batch.

@dspeck1
Copy link
Collaborator

dspeck1 commented Aug 29, 2024

I will run a test from Cromwell. In testing the GCP Batch SDK directly it will only do authentication with the docker.io prefix.

@dspeck1
Copy link
Collaborator

dspeck1 commented Aug 30, 2024

With docker.io prepended the job succeeds in Cromwell and GCP Batch with the below config. It fails without the docker.io prepend due to GCP Batch requiring it. There is a docker hash lookup error from the WorkflowDockerLookupActor. Error below. We can discuss more on meeting today.

config {
        dockerhub {
          token = "base64-encoded-docker-hub-username:password"
        }
[2024-08-30 13:56:20,10] [warn] BackendPreparationActor_for_c5f3f88a:myWorkflow.myTask:-1:1 [c5f3f88a]: Docker lookup failed
java.lang.Exception: Failed to get docker hash for docker.io/dspeck/pull-test1:v1 Request failed with status 401 and body {"details":"incorrect username or password"}

@mcovarr
Copy link
Contributor Author

mcovarr commented Aug 30, 2024

Thank you Dan. I checked my config and it appears to be okay, also the correct Docker Hub username and password are being printed out in the Cloud Logs (which they probably shouldn't be, but that's a separate issue).

When I log in with these credentials locally using Docker engine v27.1.1 and try to pull the image from our test WDL I get the following output, exit code 1, and the image is not pulled:

% docker pull "broadinstitute/cloud-cromwell:dev"
dev: Pulling from broadinstitute/cloud-cromwell

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview broadinstitute/cloud-cromwell:dev
[DEPRECATION NOTICE] Docker Image Format v1 and Docker Image manifest version 2, schema 1 support is disabled by default and will be removed in an upcoming release. Suggest the author of docker.io/broadinstitute/cloud-cromwell:dev to upgrade the image to the OCI Format or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/

I will try to find a newer private image to test with, but from your output above I'm guessing that would work.

So a few concerns here:

  • Batch (and my local machines) don't My local machine doesn't appear to be able to pull the particular broadinstitute/cloud-cromwell:devDocker image from Cromwell's CI test. This may be related to the deprecation message implying that the image uses an outdated format.
  • From the last line of your output, it looks as if the Batch backend is failing to get Docker image hashes for your private image, which is something that would break Cromwell's call caching.
  • The aforementioned issue with plaintext Docker u/p going to the logs.

@mcovarr
Copy link
Contributor Author

mcovarr commented Aug 30, 2024

New info since our meeting earlier today:

I was able to confirm that Batch actually can pull and run Docker Image Format v1 images (PR to explicitly assert this here). So that does not appear to be the source of my private Docker woes.

I also pushed a new image that is just a re-tag of ubuntu:latest to broadinstitute/cloud-cromwell:2024-08-30. Trying to run with that, with or without the docker.io/ prefix results in the error:

docker: Error response from daemon: pull access denied for broadinstitute/cloud-cromwell, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

which is a complaint about being able to access the repository, not the format of a particular image within the repository. Not sure what's going on here.

@dspeck1
Copy link
Collaborator

dspeck1 commented Aug 30, 2024

Asked some colleagues. The image manifest v1 has been depreciated for awhile. Lifesciences must be running an older docker client. A rebuild of the docker image should just fix it as long as the docker client version is not a few years old. Could a rebuild of that image be run?

Missed your earlier comment. You can strike the suggestion above.

@mcovarr
Copy link
Contributor Author

mcovarr commented Sep 4, 2024

I finally figured out that the problem has to do with special characters in a password. If I use an all-alpha password, everything works fine. If I use a password with shell metacharacters like $, ! or * then the Docker login seems to silently fail and consequently the private image pull fails as well.

@aednichols
Copy link
Collaborator

Wow, great find!

@dspeck1
Copy link
Collaborator

dspeck1 commented Sep 4, 2024

Great find! Our docker hub token does not have special characters. I had docker hub generate two more and they both do not contain special characters.

@mcovarr mcovarr changed the title WX-1828 Demo broken private Docker Hub repos in GCP Batch WX-1828 Private Docker Hub repos in GCP Batch Sep 26, 2024
@mcovarr mcovarr marked this pull request as ready for review September 27, 2024 13:41
@mcovarr mcovarr requested a review from a team as a code owner September 27, 2024 13:41
@mcovarr mcovarr merged commit d3ded6f into develop Sep 27, 2024
37 checks passed
@mcovarr mcovarr deleted the wx_1828_private_docker_repos_in_batch branch September 27, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants