fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

NanoCode012 · 2024-12-12T11:28:12Z

Description

Our prior turn matching code depended on the content being encoded to be consistent despite having special tokens in front of it (like with Gemma tokenizer).

However, this is not the case with Mistral tokenizer.

mistral:

Frame all your answers -> 18392, 1312, 1342, 11962, 1158..

<|im_start|>system\nFrame all your answers -> 32768, 7342, 781, 4890, 1312, 1342, 11962..

Now, we use a dummy message to with apply_chat_template to figure out where the content begins and ends. This also decreases our dependence on always expecting content to be available, when it would not be for some tool_calling cases.

This PR also supersedes #2115 and should work with tool_calling datasets ~~with one caveat, it cannot detect that the tool turn's EOT but since we don't want to train on tool outputs, this should be fine for now,~~ fixed.

TODO:

Add a test for mistral
Add test for tool_calling
Fix tool-calling eot handling
Discussion: set train_on_eos: turn by default

Breaking change:

field_messages now defaults to messages. The docs were correct, but the code didn't reflect this.
default: train_on_eos: turn
default: roles_to_train: ["assistant"]

Motivation and Context

How has this been tested?

Ran with these

alpaca ✅ (if skips system)
mistral_v1 ✅ (if remove exception check)
mistral_v2v3 ✅ (f remove exception check)
chatml ✅
gemma ✅ (if remove exception check)
cohere ✅
llama3 ✅
llama3_2_vision ✅

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

NanoCode012 · 2024-12-13T07:39:52Z

This issue is reproducible via unit test.

Before change:

After change, the test passes.

NanoCode012 · 2024-12-13T09:45:57Z

Updated the unit test to check over the various tokenizers.

There's still cases of mistral failing (but due to the test, not the source). I will rewrite the test to remove calls to tokenizer.encode and use tokenizer.apply_chat_template to conform to our source as well.

…alling field

…sistant msg

…prompt

NanoCode012 added 16 commits December 17, 2024 11:02

fix: use apply_chat_template to find turn boundaries and allow tool_c…

b428808

…alling field

fix: keys to include in turn

bb7de62

feat(doc): explicitly recommend setting train_on_eos and roles_to_train

55081a5

fix: eos not being masked for tool due to template padding

8bf9bab

chore: clear up docs

44a98a3

fix: default messages format, train_on_eos: turn, and train on all as…

e6bd567

…sistant msg

fix: properly warn if empty content

4f38ba3

feat: parametrize chat_template tests to test different tokenizers

6b2fe0a

fix: set proper default for message key

74b116a

fix: update defaults to match load function

eaa57ee

fix: change defaults to use new

3b4dc8b

feat: add tool_calling dataset

9d950b7

feat: add tool_calling test

aba4283

fix: add handling of edge case of mistral tokenizer with only system …

5526e7a

…prompt

feat: refactor all test to follow source code

4179e4f

fix: remove unnecessary eos_token from phi35

0c9237f

winglian force-pushed the fix/chat_template_tokenizer_mask branch from a63f47f to 0c9237f Compare December 17, 2024 16:02

fix test for phi3.5 since eos was dropped from chat_template

a4a33c1

winglian added the ready to merge label Dec 17, 2024

winglian merged commit 10cfecf into main Dec 17, 2024
10 of 11 checks passed

winglian deleted the fix/chat_template_tokenizer_mask branch December 17, 2024 21:42

winglian mentioned this pull request Jan 24, 2025

fixes last eos token not in labels on basic use case #2287

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

NanoCode012 commented Dec 12, 2024 •

edited

Loading

NanoCode012 commented Dec 13, 2024

NanoCode012 commented Dec 13, 2024

fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

Conversation

NanoCode012 commented Dec 12, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

NanoCode012 commented Dec 13, 2024

NanoCode012 commented Dec 13, 2024

NanoCode012 commented Dec 12, 2024 •

edited

Loading