-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Tokenization + misc fixes #354
Conversation
hynky1999
commented
Oct 9, 2024
- Fixes the info metric_aggregated default constructor. This was causing errror when one used multiple different metrics for one task (e.g pmi and token norm)
- Fixes pair tokenization:
- In case on pairwise tokenization the continuation no longer has added special tokens + it's simplified
- The non-pairwise tokenization now uses last token as continuation in case the last token gets merged
- Add pair tokenization to VLLM
- Changes multiprocessing to multiprocess which allow to use lambda functions and partial applications in task configs (before this would fail when dataset loading process was > 1 and multiprocessing was used)
…r non-pairwise, add pairwise to vllm, use multiprocess fordataset loading
src/lighteval/models/model_config.py
Outdated
@@ -226,6 +226,7 @@ class VLLMModelConfig: | |||
multichoice_continuations_start_space: bool = ( | |||
True # whether to add a space at the start of each continuation in multichoice generation | |||
) | |||
pair_wise_tokenization: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairwise
is one single word -> pairwise_tokenization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nicer but I think we need to double check that it will work
self.tok_encode(context, add_special_tokens=self.add_special_tokens), | ||
self.tok_encode(continuation, add_special_tokens=self.add_special_tokens), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhh why do we want the special tokens for the continuation too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch I re-wrote it slightly based on discussion on slack