Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast tokenizer #32

Open
paulcx opened this issue Jun 19, 2024 · 2 comments
Open

Fast tokenizer #32

paulcx opened this issue Jun 19, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@paulcx
Copy link

paulcx commented Jun 19, 2024

目前的tokenizer都与之前的不一样了(vocab里缺少了id 3-13, 新增了许多added_tokens),是有什么特别理由吗?

例如:
https://huggingface.co/01-ai/Yi-1.5-34B-Chat/blob/main/tokenizer.json
https://huggingface.co/01-ai/Yi-1.5-34B-32K/blob/main/tokenizer.json

是否可以在vocab补上缺失的那几个tokens?

@nuoma
Copy link

nuoma commented Jul 2, 2024

你好,因为我们发现fast tokenizer会有一些问题,比如32K base模型无法输出空格,但slow tokenizer不会出现,所以对tokenier.json进行了更新。

@paulcx
Copy link
Author

paulcx commented Jul 2, 2024

你好,因为我们发现fast tokenizer会有一些问题,比如32K base模型无法输出空格,但slow tokenizer不会出现,所以对tokenier.json进行了更新。

能给个示例吗,我测试下来fast和slow都可以正常输出空格(token_id)。

@Haijian06 Haijian06 added the bug Something isn't working label Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants