Fix Tokenization + misc fixes #354

hynky1999 · 2024-10-09T20:15:13Z

Fixes the info metric_aggregated default constructor. This was causing errror when one used multiple different metrics for one task (e.g pmi and token norm)
Fixes pair tokenization:

In case on pairwise tokenization the continuation no longer has added special tokens + it's simplified
The non-pairwise tokenization now uses last token as continuation in case the last token gets merged

Add pair tokenization to VLLM
Changes multiprocessing to multiprocess which allow to use lambda functions and partial applications in task configs (before this would fail when dataset loading process was > 1 and multiprocessing was used)

…r non-pairwise, add pairwise to vllm, use multiprocess fordataset loading

clefourrier · 2024-10-10T07:00:56Z

src/lighteval/models/model_config.py

@@ -226,6 +226,7 @@ class VLLMModelConfig:
    multichoice_continuations_start_space: bool = (
        True  # whether to add a space at the start of each continuation in multichoice generation
    )
+    pair_wise_tokenization: bool = False


pairwise is one single word -> pairwise_tokenization

src/lighteval/models/abstract_model.py

tests/models/test_abstract_mode.py

clefourrier

Looking nicer but I think we need to double check that it will work

NathanHB · 2024-10-10T10:58:42Z

src/lighteval/models/abstract_model.py

+                self.tok_encode(context, add_special_tokens=self.add_special_tokens),
+                self.tok_encode(continuation, add_special_tokens=self.add_special_tokens),


mhh why do we want the special tokens for the continuation too ?

Thanks for the catch I re-wrote it slightly based on discussion on slack

src/lighteval/models/model_config.py

hynky1999 added 2 commits October 9, 2024 21:52

fix loggers with multimetric, fix pariwise tokenization + fallback fo…

814d06f

…r non-pairwise, add pairwise to vllm, use multiprocess fordataset loading

fix test + implementation

bb36ef0

hynky1999 changed the title ~~Fix pair_wise tokenization~~ Fix Tokenization + misc fixes Oct 9, 2024

hynky1999 requested a review from NathanHB October 9, 2024 20:15

clefourrier reviewed Oct 10, 2024

View reviewed changes

src/lighteval/models/abstract_model.py Outdated Show resolved Hide resolved

clefourrier reviewed Oct 10, 2024

View reviewed changes

tests/models/test_abstract_mode.py Show resolved Hide resolved

clefourrier reviewed Oct 10, 2024

View reviewed changes

NathanHB reviewed Oct 10, 2024

View reviewed changes

src/lighteval/models/model_config.py Outdated Show resolved Hide resolved

NathanHB and others added 3 commits October 10, 2024 13:00

Merge branch 'main' into tokenization_fixes

fd306e9

finally do the tokenization correctly

a3edfb3

pair_wise -> pairwise

2a00c53

hynky1999 requested a review from NathanHB October 10, 2024 12:42

NathanHB approved these changes Oct 10, 2024

View reviewed changes

hynky1999 merged commit 1dfd77d into main Oct 10, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Tokenization + misc fixes #354

Fix Tokenization + misc fixes #354

hynky1999 commented Oct 9, 2024

clefourrier Oct 10, 2024

hynky1999 Oct 10, 2024

clefourrier left a comment

NathanHB Oct 10, 2024

hynky1999 Oct 10, 2024

		self.tok_encode(context, add_special_tokens=self.add_special_tokens),
		self.tok_encode(continuation, add_special_tokens=self.add_special_tokens),

Fix Tokenization + misc fixes #354

Fix Tokenization + misc fixes #354

Conversation

hynky1999 commented Oct 9, 2024

clefourrier Oct 10, 2024

Choose a reason for hiding this comment

hynky1999 Oct 10, 2024

Choose a reason for hiding this comment

clefourrier left a comment

Choose a reason for hiding this comment

NathanHB Oct 10, 2024

Choose a reason for hiding this comment

hynky1999 Oct 10, 2024

Choose a reason for hiding this comment