Modify LLM Trainer to support BERT and Tiny LLaMA #2031

andreyvelich · 2024-03-15T03:21:54Z

These changes should allow us to use train API with BERT and Tiny LLaMA models and we can demo our Notebook during KubeCon. I will update Fine-tune BERT Notebook in this PR soon.
List of changes:

We need to have strong version dependency across various components (e.g. SDK, Storage Initializer, Trainer) since we dump all Trainer settings as container argument and they should be compatible within different backend components.
I used save_to_disk, load_from_disk APIs to download and upload HuggingFace dataset. That will allow us to introduce split parameter to reduce number of samples before saving dataset to the disk. I understand that save_to_disk might not work with IterableDataset HuggingFace dataset, but we can discuss further what we can do with that.
I removed device_map, pad_token, and add_pad_token settings from AutoTokenizer. Some of these settings don't work with BERT (e.g. device_map): BertForSequenceClassification does not support 'device_map':"auto" yet huggingface/transformers#25296. For the long-term we should discuss if we need to introduce Tokenizer settings for users where they can set appropriate params.
Tokenizer function should be set as follows to work with BERT and Tiny LLaMA Tokenizer:

lambda x: tokenizer(x["text"], padding="max_length", truncation=True)

I added Data Collator only for causal language modeling. In the future, we should discuss how we should set this parameter in Trainer.
If PyTorchJob has 1 worker, ReadWriteOnce mode should be sufficient for PVC.
I removed blank spaces from example Notebook as @PeterWrighten suggested here: Add Fine-Tune BERT LLM Example #2021 (comment). With that it will be easier for us to integrate CI tests for those Notebooks.

Please take a look at these changes.
/assign @johnugeorge @deepanker13 @tenzen-y @kuizhiqing

/hold for the review

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow · 2024-03-15T03:22:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-03-15T03:26:14Z

Pull Request Test Coverage Report for Build 8300892008

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 42.908%

Totals
Change from base Build 8270129783:	0.0%
Covered Lines:	3757
Relevant Lines:	8756

💛 - Coveralls

johnugeorge · 2024-03-15T04:20:19Z

sdk/python/kubeflow/training/utils/utils.py

@@ -394,6 +395,10 @@ def get_pvc_spec(
        ),
    )

+    # If PyTorchJob has 1 worker, ReadWriteOnce access mode is sufficient for PVC.
+    if num_workers == 1:
+        pvc_spec.spec.access_modes = ["ReadWriteOnce"]


can we take this in storage config? And use it in line 391 instead of this

Good point, let me change it.

johnugeorge · 2024-03-15T04:25:50Z

sdk/python/kubeflow/trainer/hf_llm_training.py

+    if "train" in dataset:
+        train_data = dataset["train"]
+    else:
+        train_data = dataset

    try:
        eval_data = dataset["eval"]


Is it always "dataset["eval"]"

@johnugeorge It depends on the dataset.
If dataset doesn't have eval data, we can use dataset.train_test_split(test_size=0.1, stratify_by_column="label"). In that case train and eval dataset will be store under train and test keys.
Should we think about various use-cases in the followup PRs @johnugeorge ?

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y

Generally lgtm
I'd like to merge this PR ASAP since this is a blocker for all PRs.

tenzen-y · 2024-03-15T16:21:56Z

sdk/python/kubeflow/storage_initializer/s3.py

@@ -39,6 +39,8 @@ def load_config(self, serialised_args):
        self.config = S3DatasetParams(**json.loads(serialised_args))

    def download_dataset(self):
+        import boto3


Should we put this import on the top?

@tenzen-y I did this on purpose so Training Operator SDK won't be dependant on boto3 while importing S3 storage init: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L125

tenzen-y · 2024-03-15T16:23:21Z

sdk/python/kubeflow/trainer/hf_llm_training.py

    )
+
+    # TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer.


Suggested change

# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer.

# TODO (andreyvelich): Currently, data collector is supported only for casual LM Transformer.

What is TODO? I guess that you'd like to support data collector other than casual LM Transformer, right?
If so, could we open an issue?

I think, it calls Data Collator in HuggingFace: https://huggingface.co/docs/transformers/en/main_classes/data_collator

We need to investigate if we want to apply Data Collator for other transformers. I will create an issue.

I see. Thanks.

Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

johnugeorge · 2024-03-15T17:47:44Z

/lgtm
/hold for @tenzen-y

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

deepanker13 · 2024-03-15T18:12:40Z

sdk/python/kubeflow/storage_initializer/hugging_face.py

@@ -77,11 +94,19 @@ def load_config(self, serialised_args):
        self.config = HfDatasetParams(**json.loads(serialised_args))

    def download_dataset(self):
-        print("downloading dataset")
+        logger.info("Downloading dataset")
+        logger.info("-" * 40)
        import huggingface_hub
        from datasets import load_dataset

        if self.config.access_token:
            huggingface_hub.login(self.config.access_token)

        load_dataset(self.config.repo_id, cache_dir=VOLUME_PATH_DATASET)


@andreyvelich why are we downloading the dataset again

It's great catch @deepanker13!
We should remove it.

review-notebook-app · 2024-03-15T18:13:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2024-03-15T18:52:42Z

/hold cancel

andreyvelich · 2024-03-15T18:52:52Z

/assign @deepanker13 @johnugeorge @tenzen-y

johnugeorge · 2024-03-15T19:27:40Z

/lgtm

* Modify LLM Trainer to support BERT and Tiny LLaMA Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Access PVC access modes to storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Format Python files Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Distribute datasets Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update example to fine tune BERT with Train API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove dataset download twice Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Modify LLM Trainer to support BERT and Tiny LLaMA

3e2e1a7

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot assigned deepanker13 Mar 15, 2024

google-oss-prow bot added the do-not-merge/hold label Mar 15, 2024

google-oss-prow bot assigned johnugeorge, kuizhiqing and tenzen-y Mar 15, 2024

google-oss-prow bot requested a review from jinchihe March 15, 2024 03:22

google-oss-prow bot added the size/L label Mar 15, 2024

google-oss-prow bot requested a review from kuizhiqing March 15, 2024 03:22

google-oss-prow bot added the approved label Mar 15, 2024

johnugeorge reviewed Mar 15, 2024

View reviewed changes

andreyvelich added 2 commits March 15, 2024 12:26

Access PVC access modes to storage config

519c11a

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Format Python files

d5bd8b2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

tenzen-y mentioned this pull request Mar 15, 2024

fix: Upgrade controller-gen to v0.14.0 #2026

Merged

1 task

tenzen-y reviewed Mar 15, 2024

View reviewed changes

Distribute datasets

a1ec3e6

Fix access modes in storage config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the lgtm label Mar 15, 2024

Update example to fine tune BERT with Train API

f3fb861

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

deepanker13 reviewed Mar 15, 2024

View reviewed changes

google-oss-prow bot removed the lgtm label Mar 15, 2024

google-oss-prow bot added size/XXL and removed size/L labels Mar 15, 2024

andreyvelich force-pushed the distributed-data-train branch from 4aed481 to c641c60 Compare March 15, 2024 18:39

Remove dataset download twice

6717f93

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the distributed-data-train branch from c641c60 to 6717f93 Compare March 15, 2024 18:39

google-oss-prow bot removed the do-not-merge/hold label Mar 15, 2024

google-oss-prow bot added the lgtm label Mar 15, 2024

google-oss-prow bot merged commit bb8bba0 into kubeflow:master Mar 15, 2024
37 checks passed

andreyvelich deleted the distributed-data-train branch March 15, 2024 19:29

andreyvelich mentioned this pull request Mar 15, 2024

[SDK] Use HuggingFace Data Collator for more Transformers in LLM Trainer #2032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify LLM Trainer to support BERT and Tiny LLaMA #2031

Modify LLM Trainer to support BERT and Tiny LLaMA #2031

andreyvelich commented Mar 15, 2024 •

edited

Loading

google-oss-prow bot commented Mar 15, 2024

coveralls commented Mar 15, 2024 •

edited

Loading

johnugeorge Mar 15, 2024

andreyvelich Mar 15, 2024

johnugeorge Mar 15, 2024

andreyvelich Mar 15, 2024

johnugeorge Mar 15, 2024

tenzen-y left a comment

tenzen-y Mar 15, 2024

andreyvelich Mar 15, 2024

tenzen-y Mar 15, 2024

tenzen-y Mar 15, 2024

andreyvelich Mar 15, 2024

andreyvelich Mar 15, 2024 •

edited

Loading

tenzen-y Mar 15, 2024

johnugeorge commented Mar 15, 2024

deepanker13 Mar 15, 2024

andreyvelich Mar 15, 2024

review-notebook-app bot commented Mar 15, 2024

andreyvelich commented Mar 15, 2024

andreyvelich commented Mar 15, 2024

johnugeorge commented Mar 15, 2024

		)

		# TODO (andreyvelich): Currently, data collator is supported only for casual LM Transformer.

Modify LLM Trainer to support BERT and Tiny LLaMA #2031

Modify LLM Trainer to support BERT and Tiny LLaMA #2031

Conversation

andreyvelich commented Mar 15, 2024 • edited Loading

google-oss-prow bot commented Mar 15, 2024

coveralls commented Mar 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8300892008

Details

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Mar 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Mar 15, 2024

andreyvelich commented Mar 15, 2024

andreyvelich commented Mar 15, 2024

johnugeorge commented Mar 15, 2024

andreyvelich commented Mar 15, 2024 •

edited

Loading

coveralls commented Mar 15, 2024 •

edited

Loading

andreyvelich Mar 15, 2024 •

edited

Loading