Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't overwrite PYTHONUNBUFFERED in PyTorchJob (and others) #1921

Open
brannondorsey opened this issue Sep 26, 2023 · 17 comments
Open

Don't overwrite PYTHONUNBUFFERED in PyTorchJob (and others) #1921

brannondorsey opened this issue Sep 26, 2023 · 17 comments
Assignees

Comments

@brannondorsey
Copy link

brannondorsey commented Sep 26, 2023

The PyTorch controller sets PYTHONUNBUFFERED=0 below, independent of wether or not the user has configured this variable themselves.

// Set PYTHONUNBUFFERED to true, to disable output buffering.
// Ref https://stackoverflow.com/questions/59812009/what-is-the-use-of-pythonunbuffered-in-docker-file.
podTemplateSpec.Spec.Containers[i].Env = append(
podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
Name: "PYTHONUNBUFFERED",
Value: "1",
})

When I set this value myself in a PyTorchJob as shown below, the resulting pod created by the controller has a second PYTHONUNBUFFERED environment variable set by the controller, which overwrites the user-defined value.

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: example
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: ubuntu
              env:
                - name: PYTHONUNBUFFERED
                  value: "1"
              #...
# output snippet from `kubectl describe pod <master|worker>`
    Environment:
      PYTHONUNBUFFERED:             1
      PYTHONUNBUFFERED:             0

The behavior that I would expect is that the PyTorch job controller sets PYTHONUNBUFFERED if and only if the user has not already set it in their own env definition.

Based on a quick git grep, I also expect this behavior to be occurring in the PaddleJob and XGBoostJob controllers as well.

NOTE: The value of this variable set in master is now 0 instead of 1 (see PR), however my code snippet is from a controller with version v1.6.0, where it is still set to 0. Even though v1.7.0 will fix my use case, I'd argue this is something which shouldn't be overwritten by the controller, if the user defines the environment variable in the pod spec themselves.

@tenzen-y
Copy link
Member

It makes sense.
/kind feature

@tenzen-y
Copy link
Member

Feel free to open PR.

@sxwl-donggang
Copy link

sxwl-donggang commented Nov 29, 2023

If users can config env "PYTHONUNBUFFERED",they can also config other env like "WORLD_SIZE"、“RANK”. This is a common problem. Whether all the env setting place should keep the user configured value.

@tenzen-y
Copy link
Member

If users can config env "PYTHONUNBUFFERED",they can also config other env like "WORLD_SIZE"、“RANK”. This is a common problem. Whether all the env setting place should keep the user configured value.

I don't think that we should allow users to modify the WORLD_SIZE and RANK since those envs are managed by the training-operator.
If users wan to modify those envs, users need not to use the training-operator. The batch/v1 job should be enough.

@sxwl-donggang
Copy link

sxwl-donggang commented Nov 29, 2023

If users can config env "PYTHONUNBUFFERED",they can also config other env like "WORLD_SIZE"、“RANK”. This is a common problem. Whether all the env setting place should keep the user configured value.

I don't think that we should allow users to modify the WORLD_SIZE and RANK since those envs are managed by the training-operator. If users wan to modify those envs, users need not to use the training-operator. The batch/v1 job should be enough.

So training-operator setPodEnv function should overwrite the WORLD_SIZE and RANK if user set instead of append .

			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "WORLD_SIZE",
				Value: strconv.Itoa(worldSize),
			})
			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "RANK",
				Value: strconv.Itoa(rank),
			})

@tenzen-y
Copy link
Member

tenzen-y commented Nov 29, 2023

If users can config env "PYTHONUNBUFFERED",they can also config other env like "WORLD_SIZE"、“RANK”. This is a common problem. Whether all the env setting place should keep the user configured value.

I don't think that we should allow users to modify the WORLD_SIZE and RANK since those envs are managed by the training-operator. If users wan to modify those envs, users need not to use the training-operator. The batch/v1 job should be enough.

So training-operator setPodEnv function should overwrite the WORLD_SIZE and RANK if user set instead of append .

			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "WORLD_SIZE",
				Value: strconv.Itoa(worldSize),
			})
			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "RANK",
				Value: strconv.Itoa(rank),
			})

Or, we can add validations to reject Jobs with managed envs like RANK.

@sxwl-donggang
Copy link

If users can config env "PYTHONUNBUFFERED",they can also config other env like "WORLD_SIZE"、“RANK”. This is a common problem. Whether all the env setting place should keep the user configured value.

I don't think that we should allow users to modify the WORLD_SIZE and RANK since those envs are managed by the training-operator. If users wan to modify those envs, users need not to use the training-operator. The batch/v1 job should be enough.

So training-operator setPodEnv function should overwrite the WORLD_SIZE and RANK if user set instead of append .

			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "WORLD_SIZE",
				Value: strconv.Itoa(worldSize),
			})
			podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, corev1.EnvVar{
				Name:  "RANK",
				Value: strconv.Itoa(rank),
			})

Or, we can add validations to reject Jobs with managed envs like RANK.

Yep! Using webhooks to validate managed environment variables should work.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/remove-lifecycle stale

@tariq-hasan
Copy link
Contributor

@tenzen-y I am interested to take on this ticket.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 2, 2024

@tenzen-y I am interested to take on this ticket.

@tariq-hasan Sorry, this depends on some other issues. So, this is not a good first issue...
/remove-help
/remove-good-first-issue

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Jun 3, 2024

/remove lifecycle-stale

@tenzen-y
Copy link
Member

tenzen-y commented Jun 3, 2024

/remove-lifecycle stale

Copy link

github-actions bot commented Sep 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 2, 2024

/remove-lifecycle stale

@tariq-hasan Hi, this was unblocked by #1993.
So, if you are still interested in this issue, it would be great.

@tariq-hasan
Copy link
Contributor

Sounds good. Thank you so much.

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants