Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert horovod example to v1 #374

Merged
merged 1 commit into from
Jul 15, 2021

Conversation

alculquicondor
Copy link
Collaborator

The controller generates keys and mounts them to the containers. The container images must know how to place the credentials and set file permissions.

First part of #373 (Replace kubectl and support roles with ssh and secrets)

Tested with the horovod CPU image

@gaocegege
Copy link
Member

/assign @Jeffwan @terrytangyuan @carmark

@@ -11,7 +11,7 @@ commonLabels:
app.kubernetes.io/component: mpijob
images:
- name: mpioperator/mpi-operator
newName: gcr.io/acondor-gke-dev/mpi-operator
newName: dev-registry/mpi-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional? Does this exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither exist (my GCR is private). It's intended for developers to use their preferred registry.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Others PTAL

@@ -0,0 +1,10 @@
#!/bin/sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it is needed for every MPIJob. Should we add some comments about why to introduce such a entrypoint script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like this will always be needed.
The problem is that SSH is very picky about not just file permissions, but also the permissions of the directory. Unfortunately, kubernetes doesn't allow to change the permissions of the mount folder.

I'll update the README once I'm done with all development (tests included). The idea is: the controller will mount the keys in /mnt/ssh. It's up to the container image to set those files in the appropriate location for their user.

Copy link
Collaborator Author

@alculquicondor alculquicondor Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, I tried to mount the files in /root/.ssh or use them from /mnt/ssh by setting AuthorizedKeysFile and IdentityFile in the ssh configuration. However, SSH complained about the folder permissions :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on, I'm investigating if I can do the same in an init container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised, but it worked. Now, I'm not convinced that it will work if the containers don't run as root, but I'll check that later. I added a bullet point in #373.

@alculquicondor
Copy link
Collaborator Author

Btw, I found a bug where the incorrect scheme was being used. As a consequence, defaults were not applied and events fail to record. See latest commit.

@alculquicondor
Copy link
Collaborator Author

@carmark @Jeffwan any other comments?

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

This PR removes the CLI argument kubectl-delivery-image, does it work for you? @carmark @terrytangyuan

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for your contribution! 🎉 👍

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @zw0610 take another look?

@alculquicondor
Copy link
Collaborator Author

uhm.... #375 was already merged, which contain these changes... should we revert? @terrytangyuan

@terrytangyuan
Copy link
Member

Then we can close this one?

@alculquicondor
Copy link
Collaborator Author

I wasn't sure if there were any changes here that I didn't put in the other branch. But it looks like it was all there.

@alculquicondor
Copy link
Collaborator Author

/reopen

@google-oss-robot
Copy link

@alculquicondor: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Collaborator Author

This commit was missing

@terrytangyuan
Copy link
Member

Looks like there’s only one changed file?

@alculquicondor
Copy link
Collaborator Author

correct

@terrytangyuan
Copy link
Member

Could you update the PR title then?

@alculquicondor alculquicondor changed the title Do inter-pod communication through SSH Revert horovod example to v1 Jul 15, 2021
@terrytangyuan
Copy link
Member

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Collaborator Author

/retest

@alculquicondor
Copy link
Collaborator Author

Is there a way to retrigger the test? It looks like the event watcher is not working or something. I can't reproduce locally.

@terrytangyuan
Copy link
Member

Just re-triggered it

@google-oss-robot google-oss-robot merged commit e9547bf into kubeflow:master Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants