Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer的问题 #36

Open
paulcx opened this issue Jul 17, 2024 · 2 comments
Open

tokenizer的问题 #36

paulcx opened this issue Jul 17, 2024 · 2 comments
Labels
question Further information is requested

Comments

@paulcx
Copy link

paulcx commented Jul 17, 2024

  1. 我们知道Yi-34B包括1.5的词表是64000,但为什么tokenizer中多出了3个token,实际是64003?
  2. Yi-1.5使用了新的chatml作为chat template,中间包括了assistant角色,但是词表中没有该token(user是有的),这导致它会被拆成两个token(ass + istant)。

其他的诸如use_fast输出结果不同,tokenizer config中默认enable add bos等问题在其他issues中也有反映

@Haijian06 Haijian06 added the question Further information is requested label Aug 2, 2024
@Haijian06
Copy link
Collaborator

@paulcx 感谢提问,目前使用 Transformers 进行推理会出现这个问题,目前可以尝试Transformers version: 4.42.4 稳定版本,然后下载tokenizer_config.json,然后修改下add_prefix_space改为false。

@paulcx
Copy link
Author

paulcx commented Aug 2, 2024

@paulcx 感谢提问,目前使用 Transformers 进行推理会出现这个问题,目前可以尝试Transformers version: 4.42.4 稳定版本,然后下载tokenizer_config.json,然后修改下add_prefix_space改为false。

问题1和2可以解答一下吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants