Train/Fine-tune API Proposal for LLMs #1945

deepanker13 · 2023-11-10T15:32:19Z

Adding proposal for a new Train API in the training operator SDK for training/finetuning

deepanker13 · 2023-11-10T15:34:50Z

@johnugeorge

johnugeorge · 2023-11-10T19:37:54Z

/cc @kubeflow/wg-training-leads @kuizhiqing @tenzen-y

coveralls · 2023-11-15T11:18:45Z

Pull Request Test Coverage Report for Build 6826741030

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
16 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.2%) to 42.722%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/pytorch/hpa.go	4	81.36%
pkg/controller.v1/mpi/mpijob_controller.go	12	80.29%

Totals
Change from base Build 6723816999:	-0.2%
Covered Lines:	3739
Relevant Lines:	8752

💛 - Coveralls

tenzen-y · 2023-11-17T13:34:41Z

So sorry for the delay. I finally arrived here, right now.

tenzen-y

At first glance, this approach looks good to me.

BTW, can you add goals and nongoals to clarify the objective?

Also, I couldn't know how we handle other frameworks since you mentioned only PyTorchJob in this proposal. So, can you add proposals about other frameworks as well?

I appreciate your @johnugeorge's and @deepak-muley's efforts.

docs/proposals/train_api_proposal.md

johnugeorge · 2023-11-20T05:43:20Z

At first glance, this approach looks good to me.

BTW, can you add goals and nongoals to clarify the objective?

Also, I couldn't know how we handle other frameworks since you mentioned only PyTorchJob in this proposal. So, can you add proposals about other frameworks as well?

I appreciate your @johnugeorge's and @deepak-muley's efforts.

@tenzen-y For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

tenzen-y · 2023-11-21T03:27:47Z

For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

@johnugeorge That makes sense. Can you put it on the proposal?

deepanker13 · 2023-11-21T08:40:00Z

For this new higher level api for different model providers, framework abstraction is hidden to the user. From user perspsective to use a specific model provider, it doesn't matter if distributed training is deployed using PytorchJob or not. For a future use case where a new model provider requires tensorflow training, we need to extend SDK to deploy a distributed TF training. I think, we can provide case by case basis. For Huggingface, PytorchJob support should be sufficient to deploy a distributed training

@johnugeorge That makes sense. Can you put it on the proposal?

Added it in limitations

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

kuizhiqing · 2023-11-21T15:39:45Z

@johnugeorge @deepanker13

Thanks for your nice proposal.

Before dive into the details, I have two questions that I would like to consult/discuss with you.

AFAIK, the LLMs training was actually dominated by deepspeed, megatron and huggingface, if we are talking about training modes, it would be two options: the command delivery mode that default in deepspeed, like mpirun/pdsh; and the rendez-vous mode, usually used in megatron/huggingface, like torchrun/accelerate. If I understand correctly, we focus on the training mode with torchrun/huggingface here, or you have a plan bigger than that ?
You are planning something out of the operator scope in kubernetes, right ? The high level things may deserve a new project above operator, so we have more flexibility (I'd like like to do something like that indeed, but not have the bandwidth yet). I'm agree if you are planning to incubate it here, while it would be appreciate to know your long term idea.

johnugeorge · 2023-11-21T17:36:13Z

@kuizhiqing Thanks for the comments

@johnugeorge @deepanker13

Thanks for your nice proposal.

Before dive into the details, I have two questions that I would like to consult/discuss with you.

AFAIK, the LLMs training was actually dominated by deepspeed, megatron and huggingface, if we are talking about training modes, it would be two options: the command delivery mode that default in deepspeed, like mpirun/pdsh; and the rendez-vous mode, usually used in megatron/huggingface, like torchrun/accelerate. If I understand correctly, we focus on the training mode with torchrun/huggingface here, or you have a plan bigger than that ?

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

You are planning something out of the operator scope in kubernetes, right ? The high level things may deserve a new project above operator, so we have more flexibility (I'd like like to do something like that indeed, but not have the bandwidth yet). I'm agree if you are planning to incubate it here, while it would be appreciate to know your long term idea.

Good point. I had similar thoughts. I agree with flexibility idea. But I feel, there will be significant overhead if we separate into new project now(to keep projects in sync etc). When the higher layer becomes really large, we can separate it out.

andreyvelich · 2023-11-21T20:12:49Z

But I feel, there will be significant overhead if we separate into new project now(to keep projects in sync etc).

I agree with @johnugeorge. I think, we should not separate SDK out of Training Operator initially since SDK is an interface for Data Scientists to interact with Kubeflow Training Operator.

Ideally, for the long-term we can create Kubeflow Python library for Kubeflow users.
So the user story looks like this to work with Kubeflow cluster:

!pip install kubeflow # Kubeflow cluster should be pre-deployed

from kubeflow import KubeflowClient
client = KubeflowClient()
client.train() # To train my model
client.tune() # To tune my model
client.serve() # To serve my model

The question in this approach is to how we should support various component versions (e.g. Training Operator, Katib, KServe (cc @yuzisun) has its own releases for the control plane).

I think, we should discuss this separately and figure out the plan to simplify Kubeflow usage.

WDYT @kuizhiqing @tenzen-y @johnugeorge ?

andreyvelich

Thank you for this effort @deepanker13 @johnugeorge!
I left a few comments

docs/proposals/train_api_proposal.md

kuizhiqing · 2023-11-23T10:52:34Z

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

Yes, I mean we can focus on two approaches for LLM training, torchrun mode with PytorchJob and mpirun/pdsh mode with MPIJob.

@johnugeorge @andreyvelich
I agree with you that we do it as SDK here.

johnugeorge · 2023-11-24T07:22:51Z

Did you refer deepspeed as a separate launcher or with other supported frameworks? For eg: To use deepspeed with HF, it should be straightforward with torchrun. If we want to have deepspeed launcher inside training-operator, we might have to setup few things that mpi-operator has already done. I love to add deepspeed support as well with this API. One possible option is to add a launcher field. In default torchrun case, it will use PytorchJob. For launchers like deepspeed, we use MPIJob underneath instead of PytorchJob. Any thoughts?

Yes, I mean we can focus on two approaches for LLM training, torchrun mode with PytorchJob and mpirun/pdsh mode with MPIJob.

@johnugeorge @andreyvelich I agree with you that we do it as SDK here.

We can add launcher in the arguments? By default, it will be torchrun. We can support deepspeed launcher by invoking MPIJob in the backend. What do you think? @kuizhiqing

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y · 2023-11-24T09:36:10Z

But there will be significant overhead if we separate into new project now(to keep projects in sync etc).

I agree with @johnugeorge. I think, we should not separate SDK out of Training Operator initially since SDK is an interface for Data Scientists to interact with Kubeflow Training Operator.

Ideally, for the long-term we can create Kubeflow Python library for Kubeflow users. So the user story looks like this to work with Kubeflow cluster:
!pip install kubeflow # Kubeflow cluster should be pre-deployed

from kubeflow import KubeflowClient
client = KubeflowClient()
client.train() # To train my model
client.tune() # To tune my model
client.serve() # To serve my model
The question in this approach is to how we should support various component versions (e.g. Training Operator, Katib, KServe (cc @yuzisun) has its own releases for the control plane).

I think, we should discuss this separately and figure out the plan to simplify Kubeflow usage.

WDYT @kuizhiqing @tenzen-y @johnugeorge ?

That makes sense. Ideally, we should provide the all-in-one Kubeflow SDK. We can discuss it outside of this proposal.

johnugeorge · 2023-12-01T13:30:49Z

Thanks @deepanker13
/lgtm

docs/proposals/train_api_proposal.md

deepanker13 · 2023-12-05T10:01:57Z

@andreyvelich @tenzen-y I have made the suggested changes.

tenzen-y

Thanks!
/approve
/assign @andreyvelich

google-oss-prow · 2023-12-05T13:04:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2023-12-05T15:25:07Z

Thank you for doing this @deepanker13!
Excited to see this feature live.
/hold for Johnu LGTM
/lgtm
/assign @johnugeorge

johnugeorge · 2023-12-05T16:33:58Z

Thanks @deepanker13
/lgtm

johnugeorge · 2023-12-05T16:34:01Z

/hold cancel

added train api proposal

b01d441

google-oss-prow bot requested review from jinchihe and kuizhiqing November 10, 2023 15:32

google-oss-prow bot added the size/L label Nov 10, 2023

google-oss-prow bot requested review from tenzen-y and a team November 10, 2023 19:37

deepanker13 force-pushed the train_api_proposal branch from 511e86f to 91d0f31 Compare November 17, 2023 13:23

feedback changes

9b63b70

deepanker13 force-pushed the train_api_proposal branch from 91d0f31 to 9b63b70 Compare November 17, 2023 13:37

tenzen-y reviewed Nov 17, 2023

View reviewed changes

johnugeorge and others added 4 commits November 21, 2023 15:50

Update docs/proposals/train_api_proposal.md

15fbc1d

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Update docs/proposals/train_api_proposal.md

adecf00

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Update docs/proposals/train_api_proposal.md

d5c9652

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

proposal review changes

2315fc4

deepanker13 force-pushed the train_api_proposal branch from 26ee2ff to 2315fc4 Compare November 21, 2023 11:46

andreyvelich reviewed Nov 22, 2023

View reviewed changes

Update docs/proposals/train_api_proposal.md

c1396de

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

deepanker13 added 2 commits December 1, 2023 15:45

review changes

31110e4

adding br tags

6f2ad5c

google-oss-prow bot assigned johnugeorge Dec 1, 2023

google-oss-prow bot added the lgtm label Dec 1, 2023

andreyvelich reviewed Dec 1, 2023

View reviewed changes

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved

andreyvelich reviewed Dec 1, 2023

View reviewed changes

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved

docs/proposals/train_api_proposal.md Outdated Show resolved Hide resolved

review changes

afda5a1

google-oss-prow bot removed the lgtm label Dec 5, 2023

tenzen-y reviewed Dec 5, 2023

View reviewed changes

google-oss-prow bot assigned andreyvelich Dec 5, 2023

google-oss-prow bot added the approved label Dec 5, 2023

google-oss-prow bot added do-not-merge/hold lgtm labels Dec 5, 2023

google-oss-prow bot removed the do-not-merge/hold label Dec 5, 2023

google-oss-prow bot merged commit 39f8b22 into kubeflow:master Dec 5, 2023
33 checks passed

This was referenced Dec 6, 2023

Train api init container creation #1957

Closed

Train api init container creation #1958

Merged

Train api dataset download changes #1959

Merged

[SDK] Train API #1962

Merged

Adding Training image needed for train api #1963

Merged

This was referenced Dec 20, 2023

adding init container specified in the spec only for the pod with mas… #1968

Closed

adding hugging face dataset download class #1970

Closed

rimolive mentioned this pull request Jan 22, 2024

Training WG roadmap for KF 1.9 kubeflow/manifests#2597

Closed

deepanker13 mentioned this pull request Feb 5, 2024

Remaining items for Train/Fine-tune sdk #2003

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train/Fine-tune API Proposal for LLMs #1945

Train/Fine-tune API Proposal for LLMs #1945

deepanker13 commented Nov 10, 2023 •

edited by johnugeorge

Loading

deepanker13 commented Nov 10, 2023

johnugeorge commented Nov 10, 2023

coveralls commented Nov 15, 2023 •

edited

Loading

tenzen-y commented Nov 17, 2023

tenzen-y left a comment

johnugeorge commented Nov 20, 2023

tenzen-y commented Nov 21, 2023

deepanker13 commented Nov 21, 2023

kuizhiqing commented Nov 21, 2023

johnugeorge commented Nov 21, 2023 •

edited

Loading

andreyvelich commented Nov 21, 2023

andreyvelich left a comment

kuizhiqing commented Nov 23, 2023

johnugeorge commented Nov 24, 2023

tenzen-y commented Nov 24, 2023 •

edited

Loading

johnugeorge commented Dec 1, 2023

deepanker13 commented Dec 5, 2023

tenzen-y left a comment

google-oss-prow bot commented Dec 5, 2023

andreyvelich commented Dec 5, 2023

johnugeorge commented Dec 5, 2023

johnugeorge commented Dec 5, 2023

Train/Fine-tune API Proposal for LLMs #1945

Train/Fine-tune API Proposal for LLMs #1945

Conversation

deepanker13 commented Nov 10, 2023 • edited by johnugeorge Loading

deepanker13 commented Nov 10, 2023

johnugeorge commented Nov 10, 2023

coveralls commented Nov 15, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6826741030

Warning: This coverage report may be inaccurate.

💛 - Coveralls

tenzen-y commented Nov 17, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

johnugeorge commented Nov 20, 2023

tenzen-y commented Nov 21, 2023

deepanker13 commented Nov 21, 2023

kuizhiqing commented Nov 21, 2023

johnugeorge commented Nov 21, 2023 • edited Loading

andreyvelich commented Nov 21, 2023

andreyvelich left a comment

Choose a reason for hiding this comment

kuizhiqing commented Nov 23, 2023

johnugeorge commented Nov 24, 2023

tenzen-y commented Nov 24, 2023 • edited Loading

johnugeorge commented Dec 1, 2023

deepanker13 commented Dec 5, 2023

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Dec 5, 2023

andreyvelich commented Dec 5, 2023

johnugeorge commented Dec 5, 2023

johnugeorge commented Dec 5, 2023

deepanker13 commented Nov 10, 2023 •

edited by johnugeorge

Loading

coveralls commented Nov 15, 2023 •

edited

Loading

johnugeorge commented Nov 21, 2023 •

edited

Loading

tenzen-y commented Nov 24, 2023 •

edited

Loading