-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert horovod example to v1 #374
Conversation
/assign @Jeffwan @terrytangyuan @carmark |
@@ -11,7 +11,7 @@ commonLabels: | |||
app.kubernetes.io/component: mpijob | |||
images: | |||
- name: mpioperator/mpi-operator | |||
newName: gcr.io/acondor-gke-dev/mpi-operator | |||
newName: dev-registry/mpi-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intentional? Does this exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither exist (my GCR is private). It's intended for developers to use their preferred registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Others PTAL
examples/horovod/entrypoint.sh
Outdated
@@ -0,0 +1,10 @@ | |||
#!/bin/sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if it is needed for every MPIJob. Should we add some comments about why to introduce such a entrypoint script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like this will always be needed.
The problem is that SSH is very picky about not just file permissions, but also the permissions of the directory. Unfortunately, kubernetes doesn't allow to change the permissions of the mount folder.
I'll update the README once I'm done with all development (tests included). The idea is: the controller will mount the keys in /mnt/ssh. It's up to the container image to set those files in the appropriate location for their user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR, I tried to mount the files in /root/.ssh or use them from /mnt/ssh by setting AuthorizedKeysFile
and IdentityFile
in the ssh configuration. However, SSH complained about the folder permissions :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on, I'm investigating if I can do the same in an init container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised, but it worked. Now, I'm not convinced that it will work if the containers don't run as root, but I'll check that later. I added a bullet point in #373.
949c75b
to
9c12aa2
Compare
Btw, I found a bug where the incorrect scheme was being used. As a consequence, defaults were not applied and events fail to record. See latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
This PR removes the CLI argument kubectl-delivery-image
, does it work for you? @carmark @terrytangyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for your contribution! 🎉 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @zw0610 take another look?
uhm.... #375 was already merged, which contain these changes... should we revert? @terrytangyuan |
Then we can close this one? |
I wasn't sure if there were any changes here that I didn't put in the other branch. But it looks like it was all there. |
/reopen |
@alculquicondor: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This commit was missing |
Looks like there’s only one changed file? |
correct |
Could you update the PR title then? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Is there a way to retrigger the test? It looks like the event watcher is not working or something. I can't reproduce locally. |
Just re-triggered it |
The controller generates keys and mounts them to the containers. The container images must know how to place the credentials and set file permissions.
First part of #373 (Replace kubectl and support roles with ssh and secrets)
Tested with the horovod CPU image