Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement v2 controller that sets up SSH for communication #373

Open
12 of 16 tasks
alculquicondor opened this issue Jul 6, 2021 · 9 comments
Open
12 of 16 tasks

Implement v2 controller that sets up SSH for communication #373

alculquicondor opened this issue Jul 6, 2021 · 9 comments

Comments

@alculquicondor
Copy link
Collaborator

alculquicondor commented Jul 6, 2021

Implementation for https://github.com/kubeflow/mpi-operator/blob/master/proposals/scalable-robust-operator.md

@gaocegege
Copy link
Member

https://www.kubeflow.org/docs/about/contributing/#joining-the-kubeflow-github-org

Hi, could you please join the kubeflow org? Then we do not need to trigger the CICD for your PR manually.

@alculquicondor
Copy link
Collaborator Author

Sent PR kubeflow/internal-acls#473

Thanks for the suggestion

@alculquicondor
Copy link
Collaborator Author

I verified that images docker.io/kubeflow/mpi-horovod-mnist and docker.io/mpioperator/tensorflow-benchmarks just work with the new controller. Marking that as done.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 12, 2021

@alculquicondor Has community discussed tradeoffs about job vs pod for launcher, statefulsets vs plain pods for workers?

@alculquicondor
Copy link
Collaborator Author

Yes for launcher. See the discussion here #386

For workers, it's still open for discussion. We could do Statefulsets, but I think plain pods might be fine for now. We might migrate to Indexed Jobs at some point, but since it's only available in k8s 1.22, it's kind of early to discuss.

@alculquicondor
Copy link
Collaborator Author

alculquicondor commented Aug 17, 2021

I think this is pretty much ready. The last things I would like to do are:

@terrytangyuan
Copy link
Member

* Add documentation (is there a website, or should I just do it on readmes)?

There's this page https://www.kubeflow.org/docs/components/training/mpi/

@tenzen-y
Copy link
Member

Maybe we can introduce Indexed Job to mpi-operator v2 once kubernetes/enhancements#3715 is graduated to beta.

@tenzen-y
Copy link
Member

Consider introducing JobSet instead of managing raw pods for the workers: https://github.com/kubernetes-sigs/jobset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants