Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some suggestions about engineering optimization #1703

Open
HeGaoYuan opened this issue Dec 22, 2022 · 10 comments
Open

Some suggestions about engineering optimization #1703

HeGaoYuan opened this issue Dec 22, 2022 · 10 comments

Comments

@HeGaoYuan
Copy link
Contributor

Hi, Kubeflow team
Thanks for your great work firstly. I have used your training-operator in production environment for a period of time. And I found some engineering problems during my using.

  1. The log format is mixed up. There are at least three different log format/implementation: the controller-runtime itself, the training-operator project, the kubeflow/common module.
  2. The whole architecture is not clean. Specifically, according to what I know, the kubeflow/common module was designed for the old Controller architecture, now you are using controller-runtime to implement the all kinds of Job Controller, but you don't reconstruct the kubelow/common, this cause so many redundant and inconsistent codes.
  3. The conditions of job status changing are not told to users. Specifically, in real usage, the changing of job status is very import. Users usually want to know clearly when the job will failed, when the job should get started, all this details are not told to users, users should to read codes to recognize it. Additionally, whether the implementation of this part is totally right should be discussed.
  4. There are some other bugs and optimizations that I can make specific issues. For example, the default worker thread num in any Job Controller is 1 and can't be changed, which is unacceptable in production environment.

Looking forward to your reply.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 22, 2022

Thanks @HeGaoYuan for being the power user of Training operator. We really appreciate your feedback.

  1. Agree. This needs to be fixed.
  2. The motivation for kubeflow common was to bring a common framework when there were multiple operator repos that tried to use common features. It is true that it is harder to keep consistent. However, it is currently been used by training operator and v2 mpi operator repos. Hence, I am not sure if we have a better solution. In the future when v2 MPI operator is merged into training operator(?), we can think of merging common code into training operator as well. Thoughts?
  3. Can you explain which changes in JobStatus you are referring to? What extra details do you expect?
  4. We are happy to hear more and and please file issues. We can expose parameters if need to be configurable.

We love contributions and it would be great if you can upstream some of these fixes.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 22, 2022

/cc @kubeflow/wg-training-leads @kubeflow/common-team

/cc @alculquicondor

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Dec 22, 2022

@johnugeorge Thanks for your quick and constructive reply.

  1. This question is hard to answer. I will reply it later.
  2. There are 5 JobConditionType( Created, Running, Restarting, Succeeded, Failed), it is like a state machine. Users always want to know the condition of state transition(Yes, it is thestate transition table). For example, when a Job is transforming form Created to Running. First, we missed the state transition table as a document. Second, there are maybe some bugs in the implementations. For instance, in the blew codes, the PytorchJob will transform from others( mostly is Created) to Running when the PytorchJob include master and the master Pod is running. But in real case, only when master and all workers pod are running, then the PytorchJob is entering real training. I test this at least in pytorch 1.9. So, image this situation, there is a PytorchJob with one master and three workers, if the master pod is running but one worker pod is block ing for some reasons, the PytorchJob will transform to Running state, but actually it is not training at all.

if ContainsMasterSpec(replicas) {
if rtype == commonv1.ReplicaType(kubeflowv1.PyTorchJobReplicaTypeMaster) {
if running > 0 {
msg := fmt.Sprintf("PyTorchJob %s is running.", pytorchjob.Name)
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRunning, commonutil.JobRunningReason, msg)

@johnugeorge
Copy link
Member

To your point3,

Currently Job goes into running state if at least one pod is running. (Master in case of master-worker or any worker in case of only worker case)

To handle your case of one worker pod getting blocked, using gang scheduler like volcano will help ?

@HeGaoYuan
Copy link
Contributor Author

I think gang scheduler can only guarantee there are enough resources for all Pods of Job to schedule. After all pods of Job are scheduled, some pod maybe blocking for many reasons, for example, the volume of one Pod may set up failed.

@alculquicondor
Copy link

Re 2: The mpi-operator has minimal usage of kf/common. Primarily, it just uses constants. So I would generally welcome cleanups in the repo. Keep me in the loop, please.

MPI operator is merged into training operator

While it could be useful for some, I have anecdotal evidence that HPC users would prefers a simple controller for their MPI workloads, without having to install all of kf. Still, this is something we can discuss.

@johnugeorge
Copy link
Member

@alculquicondor I was referring to training-operator, not all of KF

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/remove-lifecycle stale

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants