[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter #5240

DriverSong · 2024-06-04T05:32:22Z

Your current environment

same to the issue #5206

🐛 Describe the bug

As the basic reason of the issue reported by #5206, the tokenizer setter of the LLM will override the cached tokenizer inited by llm_engine.

    def set_tokenizer(
        self,
        tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast],
    ) -> None:
        self.llm_engine.tokenizer.tokenizer = tokenizer

Thus, each time the len(tokenizer) is called, row __len__ is called rather than the CachedTokenizer's.
To fixed this problem of this issue, the get_cached_tokenizer adapter should be applied by the tokenizer setter of LLM.

The text was updated successfully, but these errors were encountered:

DriverSong · 2024-06-04T05:38:07Z

With the change of pr #5207, tested on qwen1.5-0.5B with the AdvertiseGen dataset, the generation decreased to 47 seconds, which cost less time than the solution of #5206 .

Processed prompts:  95%|█████████▍| 1014/1070 [00:43<00:01, 45.00it/s]
Processed prompts:  96%|█████████▌| 1022/1070 [00:43<00:00, 52.35it/s]
Processed prompts:  96%|█████████▌| 1028/1070 [00:44<00:00, 44.96it/s]
Processed prompts:  97%|█████████▋| 1034/1070 [00:44<00:00, 40.40it/s]
Processed prompts:  97%|█████████▋| 1043/1070 [00:44<00:00, 49.90it/s]
Processed prompts:  98%|█████████▊| 1049/1070 [00:44<00:00, 36.90it/s]
Processed prompts:  99%|█████████▊| 1054/1070 [00:44<00:00, 28.55it/s]
Processed prompts:  99%|█████████▉| 1058/1070 [00:45<00:00, 27.48it/s]
Processed prompts:  99%|█████████▉| 1062/1070 [00:45<00:00, 23.82it/s]
Processed prompts: 100%|█████████▉| 1065/1070 [00:45<00:00, 21.18it/s]
Processed prompts: 100%|█████████▉| 1068/1070 [00:45<00:00, 20.53it/s]
Processed prompts: 100%|██████████| 1070/1070 [00:47<00:00, 22.66it/s]
master-0: [2024-06-04 15:00:18,031] [INFO] [launch.py:351:main] Process 1405 exits successfully.

DriverSong added the bug Something isn't working label Jun 4, 2024

DriverSong mentioned this issue Jun 4, 2024

[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM #5207

Merged

simon-mo closed this as completed in #5207 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter #5240

[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter #5240

DriverSong commented Jun 4, 2024 •

edited

Loading

DriverSong commented Jun 4, 2024 •

edited

Loading

[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter #5240

[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter #5240

Comments

DriverSong commented Jun 4, 2024 • edited Loading

Your current environment

🐛 Describe the bug

DriverSong commented Jun 4, 2024 • edited Loading

DriverSong commented Jun 4, 2024 •

edited

Loading

DriverSong commented Jun 4, 2024 •

edited

Loading