Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use apply_chat_template to find turn boundaries and allow tool_calling field #2179

Merged
merged 17 commits into from
Dec 17, 2024

Conversation

NanoCode012
Copy link
Collaborator

@NanoCode012 NanoCode012 commented Dec 12, 2024

Description

Our prior turn matching code depended on the content being encoded to be consistent despite having special tokens in front of it (like with Gemma tokenizer).

However, this is not the case with Mistral tokenizer.

mistral:

Frame all your answers -> 18392, 1312, 1342, 11962, 1158..

<|im_start|>system\nFrame all your answers -> 32768, 7342, 781, 4890, 1312, 1342, 11962..

Now, we use a dummy message to with apply_chat_template to figure out where the content begins and ends. This also decreases our dependence on always expecting content to be available, when it would not be for some tool_calling cases.

This PR also supersedes #2115 and should work with tool_calling datasets with one caveat, it cannot detect that the tool turn's EOT but since we don't want to train on tool outputs, this should be fine for now, fixed.

TODO:

  • Add a test for mistral
  • Add test for tool_calling
  • Fix tool-calling eot handling
  • Discussion: set train_on_eos: turn by default

Breaking change:

  • field_messages now defaults to messages. The docs were correct, but the code didn't reflect this.
  • default: train_on_eos: turn
  • default: roles_to_train: ["assistant"]

Motivation and Context

How has this been tested?

Ran with these

  • alpaca ✅ (if skips system)
  • mistral_v1 ✅ (if remove exception check)
  • mistral_v2v3 ✅ (f remove exception check)
  • chatml ✅
  • gemma ✅ (if remove exception check)
  • cohere ✅
  • llama3 ✅
  • llama3_2_vision ✅

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

@NanoCode012
Copy link
Collaborator Author

This issue is reproducible via unit test.

Before change:
image

After change, the test passes.

@NanoCode012
Copy link
Collaborator Author

Updated the unit test to check over the various tokenizers.

There's still cases of mistral failing (but due to the test, not the source). I will rewrite the test to remove calls to tokenizer.encode and use tokenizer.apply_chat_template to conform to our source as well.

@winglian winglian force-pushed the fix/chat_template_tokenizer_mask branch from a63f47f to 0c9237f Compare December 17, 2024 16:02
@winglian winglian merged commit 10cfecf into main Dec 17, 2024
10 of 11 checks passed
@winglian winglian deleted the fix/chat_template_tokenizer_mask branch December 17, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants