-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DGL Operator: Leverage DGL on K8s #2843
Comments
It's great that you would like to contribute to our ecosystem. I'm not sure where is the best place to host your project. This is something we can discuss. |
More on this "DGL Operator",
|
After talked with @zheng-da and had an internal discussion in our team, we decided to contribute the DGL Operator repo to Kubeflow community, because 1) DGL Operator is a Golang project and Kubernetes infra, contributing to Kubeflow may touch more Golang and Kubernetes engineers; 2) Kubeflow community have a lot of experienced Golang and Kubernetes engineers, can stay together to improve the stability and high-level design. We have already submitted the proposal to Kubeflow community, please let me know if there is any issue or concern. Proposal PR: kubeflow/community#512 |
Great to see the proposal! I just cc'ed the Kubeflow Training WG leads. We will review it soon. |
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you |
This issue is closed due to lack of activity. Feel free to reopen it if you still have questions. |
This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Recently there are some internal demands on DGL/DGL-KE framework in our AI/ML teams, so we just kick off the research on distributed DGL training.
The native distributed DGL training is based on the machine level, you need to manually set up ip config, grant passwordless ssh access, use
copy_files.py
to dispatch your partition data, and uselaunch.py
to invoke your training. But what we want to offer to our users, is automatically training distributed, and most important is that the workload can be orchestrated on K8s. So we decide to develop a "DGL Operator", to leverage DGL training on K8s. It can cover distributed scaffolding tools for ML engineers, they only need to work on partition script and train script.The first version of DGL Operator will be finished by end of this month, and we are glad to open source our project, let more and more developers can involve in DGL or use DGL on K8s. However, I have a question, which is the main subject of this issue, is
dmlc
willing to host our project? I noticed thatdmlc
usually does not host any golang projects, but its ok, we can also contribute this Operator to Kubeflow Community (XGBoost Operator is hosted by Kubeflow).Looking forward to having you guys any response, be free to ping me.
Ref: XGBoost Operator Design
The text was updated successfully, but these errors were encountered: