Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DGL Operator: Leverage DGL on K8s #2843

Closed
ryantd opened this issue Apr 15, 2021 · 6 comments
Closed

DGL Operator: Leverage DGL on K8s #2843

ryantd opened this issue Apr 15, 2021 · 6 comments
Assignees

Comments

@ryantd
Copy link

ryantd commented Apr 15, 2021

This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Recently there are some internal demands on DGL/DGL-KE framework in our AI/ML teams, so we just kick off the research on distributed DGL training.

The native distributed DGL training is based on the machine level, you need to manually set up ip config, grant passwordless ssh access, use copy_files.py to dispatch your partition data, and use launch.py to invoke your training. But what we want to offer to our users, is automatically training distributed, and most important is that the workload can be orchestrated on K8s. So we decide to develop a "DGL Operator", to leverage DGL training on K8s. It can cover distributed scaffolding tools for ML engineers, they only need to work on partition script and train script.

The first version of DGL Operator will be finished by end of this month, and we are glad to open source our project, let more and more developers can involve in DGL or use DGL on K8s. However, I have a question, which is the main subject of this issue, is dmlc willing to host our project? I noticed that dmlc usually does not host any golang projects, but its ok, we can also contribute this Operator to Kubeflow Community (XGBoost Operator is hosted by Kubeflow).

Looking forward to having you guys any response, be free to ping me.

Ref: XGBoost Operator Design

@zheng-da
Copy link
Collaborator

It's great that you would like to contribute to our ecosystem. I'm not sure where is the best place to host your project. This is something we can discuss.

@ryantd
Copy link
Author

ryantd commented Apr 19, 2021

More on this "DGL Operator",

  1. It is a Golang project, may need to be a new repo
  2. It is only run on Kubernetes infra stack, like XGBoost Operator vs XGBoost and Paddle Operator vs PaddlePaddle.
  3. Users (ML engineer) only need to add partition script and train script on the top of base image we provide or build image by yourself. The copy_files.py and launch.py are included in our base image.
  4. Auto-generate ip config, auto-setup distributed workload (K8s pod), later on may support elastic workload

@ryantd
Copy link
Author

ryantd commented Apr 23, 2021

After talked with @zheng-da and had an internal discussion in our team, we decided to contribute the DGL Operator repo to Kubeflow community, because 1) DGL Operator is a Golang project and Kubernetes infra, contributing to Kubeflow may touch more Golang and Kubernetes engineers; 2) Kubeflow community have a lot of experienced Golang and Kubernetes engineers, can stay together to improve the stability and high-level design.

We have already submitted the proposal to Kubeflow community, please let me know if there is any issue or concern.

Proposal PR: kubeflow/community#512
Proposal reading friendly: https://github.com/ryantd/community/blob/dgl-operator/proposals/dgl-operator-proposal.md

@terrytangyuan
Copy link
Member

Great to see the proposal! I just cc'ed the Kubeflow Training WG leads. We will review it soon.

@github-actions
Copy link

github-actions bot commented Mar 2, 2022

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

@github-actions
Copy link

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants