Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running MPI applications as non-root #383

Merged
merged 2 commits into from
Jul 27, 2021

Conversation

alculquicondor
Copy link
Collaborator

@alculquicondor alculquicondor commented Jul 21, 2021

Part of #373

  • Adds the spec field sshAuthMountPath for MPIJob.
  • The init script sets the permissions and ownership based on the securityContext of the launcherPod
  • Adds a pure-MPI example to show how to create and use an image as non-root

Tested: unit and integration suites as well as manual runs.

@alculquicondor
Copy link
Collaborator Author

/hold
Does this work?

@terrytangyuan
Copy link
Member

Yes it should work

Copy link

@kawych kawych left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, but it's a fairly large change and it's easy to miss things. Could you confirm that it was tested end-to-end in addition to unit tests?

RUN useradd -m mpiuser
WORKDIR /home/mpiuser
COPY --chown=mpiuser sshd_config .sshd_config
RUN cat /etc/ssh/ssh_config | grep -Ev 'StrictHostKeyChecking' > /etc/ssh/ssh_config.new && \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe something like this would be more explicit and simpler?

sed -i 's/( )StrictHostKeyChecking ./\1StrictHostKeyChecking no/g' /etc/ssh/ssh_config.new

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified it a bit differently, but thanks for the suggestion :)

@@ -74,21 +74,19 @@ const (
sshAuthVolume = "ssh-auth"
sshAuthMountPath = "/mnt/ssh"
sshHomeVolume = "ssh-home"
// TODO(alculquicondor): Make home directory configurable through the API.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thing to consider: would it make sense to make the whole home directory configurable as opposed to just .ssh directory?

Copy link
Collaborator Author

@alculquicondor alculquicondor Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. But someone creating a container image might want not to use $HOME/.ssh but a completely different location. And, at the moment, I can't think of other uses of the home folder from the controller perspective.

RestartPolicy: corev1.RestartPolicyNever,
Containers: []corev1.Container{
{
Env: func() []corev1.EnvVar {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all of this be replaced by "append(ompiEnvVars, nvidiaDisableEnvVars...)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but not sure if it will have consequences on the allocation of ompiEnvVars

another alternative would be:

append(append([]corev1.EnvVar(nil), ompiEnvVars...), nvidiaDisableEnvVars...)

WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've just played with append a little and it's a little funny. [1]: "If it has sufficient capacity, the destination is resliced to accommodate the new elements. If it does not, a new underlying array will be allocated."

[1] https://pkg.go.dev/builtin#append

So in this case "append(ompiEnvVars, nvidiaDisableEnvVars...)" will work the same as "append(append(...)", but if you allocate additional memory for the ompiEnvVars slice, it will have a different effect. Examples:

a := []int{1} b := append(a[:1], []int{3}...) a[0] = 2 b[0] = 3 fmt.Printf("%d %d\n", a[0], b[0]) <- this prints "2 3"

a := []int{1, 2} b := append(a[:1], []int{3}...) a[0] = 2 b[0] = 3 fmt.Printf("%d %d\n", a[0], b[0]) <- this prints "3 3"

So append(append([]corev1.EnvVar(nil), ompiEnvVars...), nvidiaDisableEnvVars...) would be better here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that will be more predictable. I decided to add a helper method.

Copy link
Collaborator Author

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note in the description on what I did to test the change.

@@ -74,21 +74,19 @@ const (
sshAuthVolume = "ssh-auth"
sshAuthMountPath = "/mnt/ssh"
sshHomeVolume = "ssh-home"
// TODO(alculquicondor): Make home directory configurable through the API.
Copy link
Collaborator Author

@alculquicondor alculquicondor Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. But someone creating a container image might want not to use $HOME/.ssh but a completely different location. And, at the moment, I can't think of other uses of the home folder from the controller perspective.

RestartPolicy: corev1.RestartPolicyNever,
Containers: []corev1.Container{
{
Env: func() []corev1.EnvVar {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but not sure if it will have consequences on the allocation of ompiEnvVars

another alternative would be:

append(append([]corev1.EnvVar(nil), ompiEnvVars...), nvidiaDisableEnvVars...)

WDYT?

@alculquicondor
Copy link
Collaborator Author

/hold cancel

@gaocegege
Copy link
Member

/assign @terrytangyuan @zw0610 @carmark

Copy link
Member

@zw0610 zw0610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this pr does more than allowing non-root mode. If so, I would prefer this pr comes with multiple commits. What do you think? @terrytangyuan

@@ -56,6 +56,9 @@ func SetDefaults_MPIJob(mpiJob *MPIJob) {
if mpiJob.Spec.SlotsPerWorker == nil {
mpiJob.Spec.SlotsPerWorker = newInt32(1)
}
if mpiJob.Spec.SSHAuthMountPath == "" {
mpiJob.Spec.SSHAuthMountPath = "/root/.ssh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make the default path ("/root/.ssh") a constant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense for controller code, where a value might be used more than once. However, this file is all about defaults. Creating a constant would just make the reader have to jump from one line to another to see what the default for a field is.

@@ -46,6 +46,9 @@ func validateMPIJobSpec(spec *kubeflow.MPIJobSpec, path *field.Path) field.Error
} else if !validCleanPolicies.Has(string(*spec.CleanPodPolicy)) {
errs = append(errs, field.NotSupported(path.Child("cleanPodPolicy"), *spec.CleanPodPolicy, validCleanPolicies.List()))
}
if spec.SSHAuthMountPath == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this validation contradict with the defaulting here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep validation independent from defaulting. But yes, if you put them together, this line of code would not be reachable.

}
podSpec.InitContainers = append(podSpec.InitContainers, corev1.Container{
Name: "init-ssh",
Image: "alpine:3.14",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this image configurable? some cluster environment is isolated from docker registry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Name: sshHomeVolume,
MountPath: sshHomeMountPath,
MountPath: "/mnt/home-ssh",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we replace constant with hardcoded string here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a different value now (it was /root/.ssh). But sure, putting it as a constant.

Adds the spec field sshAuthMountPath for MPIJob.
The init script sets the permissions and ownership based on the securityContext of the launcherPod
Copy link
Collaborator Author

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the sample in a separate commit

@@ -56,6 +56,9 @@ func SetDefaults_MPIJob(mpiJob *MPIJob) {
if mpiJob.Spec.SlotsPerWorker == nil {
mpiJob.Spec.SlotsPerWorker = newInt32(1)
}
if mpiJob.Spec.SSHAuthMountPath == "" {
mpiJob.Spec.SSHAuthMountPath = "/root/.ssh"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense for controller code, where a value might be used more than once. However, this file is all about defaults. Creating a constant would just make the reader have to jump from one line to another to see what the default for a field is.

@@ -46,6 +46,9 @@ func validateMPIJobSpec(spec *kubeflow.MPIJobSpec, path *field.Path) field.Error
} else if !validCleanPolicies.Has(string(*spec.CleanPodPolicy)) {
errs = append(errs, field.NotSupported(path.Child("cleanPodPolicy"), *spec.CleanPodPolicy, validCleanPolicies.List()))
}
if spec.SSHAuthMountPath == "" {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep validation independent from defaulting. But yes, if you put them together, this line of code would not be reachable.

Name: sshHomeVolume,
MountPath: sshHomeMountPath,
MountPath: "/mnt/home-ssh",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a different value now (it was /root/.ssh). But sure, putting it as a constant.

}
podSpec.InitContainers = append(podSpec.InitContainers, corev1.Container{
Name: "init-ssh",
Image: "alpine:3.14",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zw0610
Copy link
Member

zw0610 commented Jul 27, 2021

/lgtm

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 9ce6467 into kubeflow:master Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants