Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Tokenization + misc fixes #354

Merged
merged 5 commits into from
Oct 10, 2024
Merged

Fix Tokenization + misc fixes #354

merged 5 commits into from
Oct 10, 2024

Conversation

hynky1999
Copy link
Collaborator

  • Fixes the info metric_aggregated default constructor. This was causing errror when one used multiple different metrics for one task (e.g pmi and token norm)
  • Fixes pair tokenization:
  1. In case on pairwise tokenization the continuation no longer has added special tokens + it's simplified
  2. The non-pairwise tokenization now uses last token as continuation in case the last token gets merged
  • Add pair tokenization to VLLM
  • Changes multiprocessing to multiprocess which allow to use lambda functions and partial applications in task configs (before this would fail when dataset loading process was > 1 and multiprocessing was used)

…r non-pairwise, add pairwise to vllm, use multiprocess fordataset loading
@hynky1999 hynky1999 changed the title Fix pair_wise tokenization Fix Tokenization + misc fixes Oct 9, 2024
@hynky1999 hynky1999 requested a review from NathanHB October 9, 2024 20:15
@@ -226,6 +226,7 @@ class VLLMModelConfig:
multichoice_continuations_start_space: bool = (
True # whether to add a space at the start of each continuation in multichoice generation
)
pair_wise_tokenization: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairwise is one single word -> pairwise_tokenization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nicer but I think we need to double check that it will work

Comment on lines 187 to 188
self.tok_encode(context, add_special_tokens=self.add_special_tokens),
self.tok_encode(continuation, add_special_tokens=self.add_special_tokens),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhh why do we want the special tokens for the continuation too ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch I re-wrote it slightly based on discussion on slack

@hynky1999 hynky1999 requested a review from NathanHB October 10, 2024 12:42
@hynky1999 hynky1999 merged commit 1dfd77d into main Oct 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants