DGL Operator: Leverage DGL on K8s #2843

ryantd · 2021-04-15T15:16:44Z

This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Recently there are some internal demands on DGL/DGL-KE framework in our AI/ML teams, so we just kick off the research on distributed DGL training.

The native distributed DGL training is based on the machine level, you need to manually set up ip config, grant passwordless ssh access, use copy_files.py to dispatch your partition data, and use launch.py to invoke your training. But what we want to offer to our users, is automatically training distributed, and most important is that the workload can be orchestrated on K8s. So we decide to develop a "DGL Operator", to leverage DGL training on K8s. It can cover distributed scaffolding tools for ML engineers, they only need to work on partition script and train script.

The first version of DGL Operator will be finished by end of this month, and we are glad to open source our project, let more and more developers can involve in DGL or use DGL on K8s. However, I have a question, which is the main subject of this issue, is dmlc willing to host our project? I noticed that dmlc usually does not host any golang projects, but its ok, we can also contribute this Operator to Kubeflow Community (XGBoost Operator is hosted by Kubeflow).

Looking forward to having you guys any response, be free to ping me.

Ref: XGBoost Operator Design

The text was updated successfully, but these errors were encountered:

zheng-da · 2021-04-15T18:13:26Z

It's great that you would like to contribute to our ecosystem. I'm not sure where is the best place to host your project. This is something we can discuss.

ryantd · 2021-04-19T01:44:26Z

More on this "DGL Operator",

It is a Golang project, may need to be a new repo
It is only run on Kubernetes infra stack, like XGBoost Operator vs XGBoost and Paddle Operator vs PaddlePaddle.
Users (ML engineer) only need to add partition script and train script on the top of base image we provide or build image by yourself. The copy_files.py and launch.py are included in our base image.
Auto-generate ip config, auto-setup distributed workload (K8s pod), later on may support elastic workload

ryantd · 2021-04-23T09:37:34Z

After talked with @zheng-da and had an internal discussion in our team, we decided to contribute the DGL Operator repo to Kubeflow community, because 1) DGL Operator is a Golang project and Kubernetes infra, contributing to Kubeflow may touch more Golang and Kubernetes engineers; 2) Kubeflow community have a lot of experienced Golang and Kubernetes engineers, can stay together to improve the stability and high-level design.

We have already submitted the proposal to Kubeflow community, please let me know if there is any issue or concern.

Proposal PR: kubeflow/community#512
Proposal reading friendly: https://github.com/ryantd/community/blob/dgl-operator/proposals/dgl-operator-proposal.md

terrytangyuan · 2021-04-23T17:55:39Z

Great to see the proposal! I just cc'ed the Kubeflow Training WG leads. We will review it soon.

github-actions · 2022-03-02T01:38:24Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions · 2022-03-10T01:33:26Z

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

jermainewang assigned classicsong Apr 19, 2021

jermainewang assigned zheng-da Apr 25, 2021

github-actions bot added the stale-issue label Mar 2, 2022

github-actions bot closed this as completed Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGL Operator: Leverage DGL on K8s #2843

DGL Operator: Leverage DGL on K8s #2843

ryantd commented Apr 15, 2021 •

edited

Loading

zheng-da commented Apr 15, 2021

ryantd commented Apr 19, 2021 •

edited

Loading

ryantd commented Apr 23, 2021

terrytangyuan commented Apr 23, 2021

github-actions bot commented Mar 2, 2022

github-actions bot commented Mar 10, 2022

DGL Operator: Leverage DGL on K8s #2843

DGL Operator: Leverage DGL on K8s #2843

Comments

ryantd commented Apr 15, 2021 • edited Loading

zheng-da commented Apr 15, 2021

ryantd commented Apr 19, 2021 • edited Loading

ryantd commented Apr 23, 2021

terrytangyuan commented Apr 23, 2021

github-actions bot commented Mar 2, 2022

github-actions bot commented Mar 10, 2022

ryantd commented Apr 15, 2021 •

edited

Loading

ryantd commented Apr 19, 2021 •

edited

Loading