Add tokenizer config #189

maxreciprocate · 2023-01-13T23:18:45Z

This PR closes

Add tokenizer truncation side option #177

https://wandb.ai/sorry/trlx/reports/Difference-truncation_side-left-right--VmlldzozMzMwNTk5

jon-tow

Looks good to me! Great work as usual 🫡

jon-tow · 2023-01-14T00:06:38Z

trlx/data/configs.py

+    :type truncation_side: str
+    """
+
+    tokenizer_path: str


(Thinking out loud on usability): I feel like we could make this optional and default to using the model_path to load the tokenizer out of AutoTokenizer.from_pretrained in the base trainer)

Also the tokenizer_ part seems redundant being already in a TokenizerConfig

maxreciprocate added 4 commits January 14, 2023 01:13

feat(configs): add tokenizer options

9d8c101

chore(base_trainer): 1gpus -> 1gpu name correction

ee8747e

fix(tests): update config reference

227cacd

fix(base_trainer): string cast num_processes

7f02ee1

jon-tow approved these changes Jan 14, 2023

View reviewed changes

jon-tow merged commit c25f598 into main Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer config #189

Add tokenizer config #189

maxreciprocate commented Jan 13, 2023

jon-tow left a comment

jon-tow Jan 14, 2023

Mistobaan Jan 16, 2023

Add tokenizer config #189

Add tokenizer config #189

Conversation

maxreciprocate commented Jan 13, 2023

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow Jan 14, 2023

Choose a reason for hiding this comment

Mistobaan Jan 16, 2023

Choose a reason for hiding this comment