refactor: Removal of pad_token when padding_free is set #368

Abhishek-TAMU · 2024-10-08T22:52:26Z

Description of the change

1- When padding_free is set, pad_token is not assigned to tokenizer or special_tokens_dict.

2- Removal of pad_token addition to GPT2Tokenizer and GPTNeoXTokenizerFast here as
this is redundant and already covered in the later condition here.

Discussion Slack Thread

Related issue number

#1336

How to verify the PR

Padding free tuning of granite-13b-base-v2 (Tokenizer: GPTNeoXTokenizerFast) PVC path: ibm-granite-pvc/granite-13b-base-v2/step_300000_ckpt

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

github-actions · 2024-10-08T22:52:38Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

github-actions · 2024-10-08T22:52:41Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Ssukriti

I think its worth adding a really simple unit test

in test you can first initialize tokenzier.pad_token=None

at end of training, now train() returns num_added_tokens , you can either check if thats 0 if padding free , 1 if not padding free

or you can check tokenizer artifacts after training

Ssukriti · 2024-10-09T23:53:06Z

tuning/sft_trainer.py

@@ -288,7 +287,9 @@ def train(
                "PAD token set to default, to make it different from eos token"
            )
            if tokenizer.eos_token != configs.DEFAULT_PAD_TOKEN:
-                tokenizer.pad_token = configs.DEFAULT_PAD_TOKEN
+                tokenizer.pad_token = (
+                    configs.DEFAULT_PAD_TOKEN if not padding_free else None


Suggested change

configs.DEFAULT_PAD_TOKEN if not padding_free else None

configs.DEFAULT_PAD_TOKEN if not padding_free else None

I dont think we need the else condition to explicitly reset it as None. We'll leave it as it was if padding_free, we just wont reset pad to a unique token . We don't want to add a new pad token / change existing pad token if padding_free basically.

Ssukriti · 2024-10-10T00:12:11Z

@fabianlim does this change look good to you functionality wise?

fabianlim · 2024-10-10T01:38:24Z

@Ssukriti personally im not sure if its a good idea to remove padding token. Because padding tokens have been so entrenced in HF legacy code that it is hard to predict what goes wrong

For example, I could use padding free, but I provide a chat_template is provided that is written with a pad_token. Removing it from the tokenizer could produce unexpected results

What is the strong motivation for simply removing the pad token from the tokenizer, even if padding free is enabled? If there is no performance impact why take uncessary risk?

fabianlim

I left a comment. please take a look. im not sure whats the motivation behind this change.

Ssukriti · 2024-10-10T03:34:01Z

Thank you @fabianlim .

We aren't removing pad token if its set in tokenizer and if padding free. We just wont additionally add it , if not set in tokenizer and if 'padding free' is passed. As we want to avoid adding new tokens if not needed by tuning. We couldn't come up with a use case as to why to add pad token if not set (with padding free)
if tokenzier.pad_token = None , we wont add it and leave it as None.
if tokenzier.pad_token != None, it will remain set and we will not modify it

Reason being , if we dont add tokens, it will skip the post-processing additional step needed for inferring on vLLM. This will simplify the pipeline a bit.

Not a dealbreaker though as the post-processing code is tested well. If you feel in future we will add templates that will need , we can wait on merging this PR till it is more clear.

kmehant · 2024-10-10T05:07:06Z

I would second on what @fabianlim said. If the postprocessing step is already there and and not a huge overhead. Its good to not remove pad token case by case and keep it there for all the usecases, since it might only complicate in the future feature additions like chat templates etc. and needs rethinking for every usecase we would need to support.

Abhishek-TAMU and others added 2 commits October 8, 2024 18:32

refactor: Removing assignment of pad_tokens in padding_free tuning

aa587e2

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Merge branch 'foundation-model-stack:main' into tokenization_cleanup

d87830e

Abhishek-TAMU requested review from anhuong, Ssukriti, aluu317, fabianlim and kmehant as code owners October 8, 2024 22:52

Abhishek-TAMU changed the title ~~Refactor: Removal of pad_token when padding_free is set~~ refactor: Removal of pad_token when padding_free is set Oct 8, 2024

github-actions bot added the refactor label Oct 8, 2024

Ssukriti requested changes Oct 10, 2024

View reviewed changes

fabianlim reviewed Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Removal of pad_token when padding_free is set #368

refactor: Removal of pad_token when padding_free is set #368

Abhishek-TAMU commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

Ssukriti left a comment

Ssukriti Oct 9, 2024

Ssukriti commented Oct 10, 2024

fabianlim commented Oct 10, 2024

fabianlim left a comment

Ssukriti commented Oct 10, 2024 •

edited

Loading

kmehant commented Oct 10, 2024 •

edited

Loading

	configs.DEFAULT_PAD_TOKEN if not padding_free else None
	configs.DEFAULT_PAD_TOKEN if not padding_free else None

refactor: Removal of pad_token when padding_free is set #368

Are you sure you want to change the base?

refactor: Removal of pad_token when padding_free is set #368

Conversation

Abhishek-TAMU commented Oct 8, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

Ssukriti left a comment

Choose a reason for hiding this comment

Ssukriti Oct 9, 2024

Choose a reason for hiding this comment

Ssukriti commented Oct 10, 2024

fabianlim commented Oct 10, 2024

fabianlim left a comment

Choose a reason for hiding this comment

Ssukriti commented Oct 10, 2024 • edited Loading

kmehant commented Oct 10, 2024 • edited Loading

Abhishek-TAMU commented Oct 8, 2024 •

edited

Loading

Ssukriti commented Oct 10, 2024 •

edited

Loading

kmehant commented Oct 10, 2024 •

edited

Loading