Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train/Fine-tune API Proposal for LLMs #1945

Merged
merged 17 commits into from
Dec 5, 2023

Conversation

deepanker13
Copy link
Contributor

@deepanker13 deepanker13 commented Nov 10, 2023

Adding proposal for a new Train API in the training operator SDK for training/finetuning

@deepanker13
Copy link
Contributor Author

@johnugeorge

@johnugeorge
Copy link
Member

/cc @kubeflow/wg-training-leads @kuizhiqing @tenzen-y

@google-oss-prow google-oss-prow bot requested review from tenzen-y and a team November 10, 2023 19:37
@coveralls
Copy link

coveralls commented Nov 15, 2023

Pull Request Test Coverage Report for Build 6826741030

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 16 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 42.722%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/pytorch/hpa.go 4 81.36%
pkg/controller.v1/mpi/mpijob_controller.go 12 80.29%
Totals Coverage Status
Change from base Build 6723816999: -0.2%
Covered Lines: 3739
Relevant Lines: 8752

💛 - Coveralls

@tenzen-y
Copy link
Member

So sorry for the delay. I finally arrived here, right now.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, this approach looks good to me.

BTW, can you add goals and nongoals to clarify the objective?

Also, I couldn't know how we handle other frameworks since you mentioned only PyTorchJob in this proposal. So, can you add proposals about other frameworks as well?

I appreciate your @johnugeorge's and @deepak-muley's efforts.

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Show resolved Hide resolved
docs/proposals/train_api_proposal.md Show resolved Hide resolved
docs/proposals/train_api_proposal.md Show resolved Hide resolved
@johnugeorge
Copy link
Member

At first glance, this approach looks good to me.

BTW, can you add goals and nongoals to clarify the objective?

Also, I couldn't know how we handle other frameworks since you mentioned only PyTorchJob in this proposal. So, can you add proposals about other frameworks as well?

I appreciate your @johnugeorge's and @deepak-muley's efforts.

@tenzen-y For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

@tenzen-y
Copy link
Member

For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

@johnugeorge That makes sense. Can you put it on the proposal?

@deepanker13
Copy link
Contributor Author

For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

@johnugeorge That makes sense. Can you put it on the proposal?

Added it in limitations

johnugeorge and others added 4 commits November 21, 2023 15:50
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@kuizhiqing
Copy link
Member

@johnugeorge @deepanker13

Thanks for your nice proposal.

Before dive into the details, I have two questions that I would like to consult/discuss with you.

  • AFAIK, the LLMs training was actually dominated by deepspeed, megatron and huggingface, if we are talking about training modes, it would be two options: the command delivery mode that default in deepspeed, like mpirun/pdsh; and the rendez-vous mode, usually used in megatron/huggingface, like torchrun/accelerate. If I understand correctly, we focus on the training mode with torchrun/huggingface here, or you have a plan bigger than that ?

  • You are planning something out of the operator scope in kubernetes, right ? The high level things may deserve a new project above operator, so we have more flexibility (I'd like like to do something like that indeed, but not have the bandwidth yet). I'm agree if you are planning to incubate it here, while it would be appreciate to know your long term idea.

@johnugeorge
Copy link
Member

johnugeorge commented Nov 21, 2023

@kuizhiqing Thanks for the comments

@johnugeorge @deepanker13

Thanks for your nice proposal.

Before dive into the details, I have two questions that I would like to consult/discuss with you.

  • AFAIK, the LLMs training was actually dominated by deepspeed, megatron and huggingface, if we are talking about training modes, it would be two options: the command delivery mode that default in deepspeed, like mpirun/pdsh; and the rendez-vous mode, usually used in megatron/huggingface, like torchrun/accelerate. If I understand correctly, we focus on the training mode with torchrun/huggingface here, or you have a plan bigger than that ?

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

  • You are planning something out of the operator scope in kubernetes, right ? The high level things may deserve a new project above operator, so we have more flexibility (I'd like like to do something like that indeed, but not have the bandwidth yet). I'm agree if you are planning to incubate it here, while it would be appreciate to know your long term idea.

Good point. I had similar thoughts. I agree with flexibility idea. But I feel, there will be significant overhead if we separate into new project now(to keep projects in sync etc). When the higher layer becomes really large, we can separate it out.

@andreyvelich
Copy link
Member

But I feel, there will be significant overhead if we separate into new project now(to keep projects in sync etc).

I agree with @johnugeorge. I think, we should not separate SDK out of Training Operator initially since SDK is an interface for Data Scientists to interact with Kubeflow Training Operator.

Ideally, for the long-term we can create Kubeflow Python library for Kubeflow users.
So the user story looks like this to work with Kubeflow cluster:

!pip install kubeflow # Kubeflow cluster should be pre-deployed

from kubeflow import KubeflowClient
client = KubeflowClient()
client.train() # To train my model
client.tune() # To tune my model
client.serve() # To serve my model

The question in this approach is to how we should support various component versions (e.g. Training Operator, Katib, KServe (cc @yuzisun) has its own releases for the control plane).

I think, we should discuss this separately and figure out the plan to simplify Kubeflow usage.

WDYT @kuizhiqing @tenzen-y @johnugeorge ?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @deepanker13 @johnugeorge!
I left a few comments

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
@kuizhiqing
Copy link
Member

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

Yes, I mean we can focus on two approaches for LLM training, torchrun mode with PytorchJob and mpirun/pdsh mode with MPIJob.

@johnugeorge @andreyvelich
I agree with you that we do it as SDK here.

@johnugeorge
Copy link
Member

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

Yes, I mean we can focus on two approaches for LLM training, torchrun mode with PytorchJob and mpirun/pdsh mode with MPIJob.

@johnugeorge @andreyvelich I agree with you that we do it as SDK here.

We can add launcher in the arguments? By default, it will be torchrun. We can support deepspeed launcher by invoking MPIJob in the backend. What do you think? @kuizhiqing

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@tenzen-y
Copy link
Member

tenzen-y commented Nov 24, 2023

But there will be significant overhead if we separate into new project now(to keep projects in sync etc).

I agree with @johnugeorge. I think, we should not separate SDK out of Training Operator initially since SDK is an interface for Data Scientists to interact with Kubeflow Training Operator.

Ideally, for the long-term we can create Kubeflow Python library for Kubeflow users. So the user story looks like this to work with Kubeflow cluster:

!pip install kubeflow # Kubeflow cluster should be pre-deployed

from kubeflow import KubeflowClient
client = KubeflowClient()
client.train() # To train my model
client.tune() # To tune my model
client.serve() # To serve my model

The question in this approach is to how we should support various component versions (e.g. Training Operator, Katib, KServe (cc @yuzisun) has its own releases for the control plane).

I think, we should discuss this separately and figure out the plan to simplify Kubeflow usage.

WDYT @kuizhiqing @tenzen-y @johnugeorge ?

That makes sense. Ideally, we should provide the all-in-one Kubeflow SDK. We can discuss it outside of this proposal.

@johnugeorge
Copy link
Member

Thanks @deepanker13
/lgtm

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved
@google-oss-prow google-oss-prow bot removed the lgtm label Dec 5, 2023
@deepanker13
Copy link
Contributor Author

@andreyvelich @tenzen-y I have made the suggested changes.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/approve
/assign @andreyvelich

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member

Thank you for doing this @deepanker13!
Excited to see this feature live.
/hold for Johnu LGTM
/lgtm
/assign @johnugeorge

@johnugeorge
Copy link
Member

Thanks @deepanker13
/lgtm

@johnugeorge
Copy link
Member

/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants