-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify LLM Trainer to support BERT and Tiny LLaMA #2031
Modify LLM Trainer to support BERT and Tiny LLaMA #2031
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 8300892008Details
💛 - Coveralls |
@@ -394,6 +395,10 @@ def get_pvc_spec( | |||
), | |||
) | |||
|
|||
# If PyTorchJob has 1 worker, ReadWriteOnce access mode is sufficient for PVC. | |||
if num_workers == 1: | |||
pvc_spec.spec.access_modes = ["ReadWriteOnce"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we take this in storage config? And use it in line 391 instead of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me change it.
if "train" in dataset: | ||
train_data = dataset["train"] | ||
else: | ||
train_data = dataset | ||
|
||
try: | ||
eval_data = dataset["eval"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always "dataset["eval"]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge It depends on the dataset.
If dataset doesn't have eval data, we can use dataset.train_test_split(test_size=0.1, stratify_by_column="label")
. In that case train and eval dataset will be store under train
and test
keys.
Should we think about various use-cases in the followup PRs @johnugeorge ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm
I'd like to merge this PR ASAP since this is a blocker for all PRs.
@@ -39,6 +39,8 @@ def load_config(self, serialised_args): | |||
self.config = S3DatasetParams(**json.loads(serialised_args)) | |||
|
|||
def download_dataset(self): | |||
import boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this import on the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I did this on purpose so Training Operator SDK won't be dependant on boto3 while importing S3 storage init: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L125
) | ||
|
||
# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer. | |
# TODO (andreyvelich): Currently, data collector is supported only for casual LM Transformer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is TODO? I guess that you'd like to support data collector other than casual LM Transformer, right?
If so, could we open an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it calls Data Collator in HuggingFace: https://huggingface.co/docs/transformers/en/main_classes/data_collator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to investigate if we want to apply Data Collator for other transformers. I will create an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks.
Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
/lgtm |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@@ -77,11 +94,19 @@ def load_config(self, serialised_args): | |||
self.config = HfDatasetParams(**json.loads(serialised_args)) | |||
|
|||
def download_dataset(self): | |||
print("downloading dataset") | |||
logger.info("Downloading dataset") | |||
logger.info("-" * 40) | |||
import huggingface_hub | |||
from datasets import load_dataset | |||
|
|||
if self.config.access_token: | |||
huggingface_hub.login(self.config.access_token) | |||
|
|||
load_dataset(self.config.repo_id, cache_dir=VOLUME_PATH_DATASET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich why are we downloading the dataset again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great catch @deepanker13!
We should remove it.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
4aed481
to
c641c60
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
c641c60
to
6717f93
Compare
/hold cancel |
/assign @deepanker13 @johnugeorge @tenzen-y |
/lgtm |
* Modify LLM Trainer to support BERT and Tiny LLaMA Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Access PVC access modes to storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Format Python files Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Distribute datasets Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update example to fine tune BERT with Train API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove dataset download twice Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Modify LLM Trainer to support BERT and Tiny LLaMA Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Access PVC access modes to storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Format Python files Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Distribute datasets Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update example to fine tune BERT with Train API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove dataset download twice Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Modify LLM Trainer to support BERT and Tiny LLaMA Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Access PVC access modes to storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Format Python files Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Distribute datasets Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update example to fine tune BERT with Train API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove dataset download twice Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Modify LLM Trainer to support BERT and Tiny LLaMA Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Access PVC access modes to storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Format Python files Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Distribute datasets Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update example to fine tune BERT with Train API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove dataset download twice Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
These changes should allow us to use
train
API with BERT and Tiny LLaMA models and we can demo our Notebook during KubeCon. I will update Fine-tune BERT Notebook in this PR soon.List of changes:
save_to_disk
,load_from_disk
APIs to download and upload HuggingFace dataset. That will allow us to introducesplit
parameter to reduce number of samples before saving dataset to the disk. I understand thatsave_to_disk
might not work withIterableDataset
HuggingFace dataset, but we can discuss further what we can do with that.device_map, pad_token, and add_pad_token
settings fromAutoTokenizer
. Some of these settings don't work with BERT (e.g. device_map): BertForSequenceClassification does not support 'device_map':"auto" yet huggingface/transformers#25296. For the long-term we should discuss if we need to introduce Tokenizer settings for users where they can set appropriate params.ReadWriteOnce
mode should be sufficient for PVC.Please take a look at these changes.
/assign @johnugeorge @deepanker13 @tenzen-y @kuizhiqing
/hold for the review