-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : fix tokenizer #2315
llama : fix tokenizer #2315
Conversation
Adding @howard0su 's draft PR and prefix matching.
Now we see some resemblence to the Meta-Tokenizer, I think. Only problem: how to integrate this into `llama.cpp` kernel.
Hm, do I understand correctly, that even the simplest prompt of If this is the case, this should be come very high priority. |
Waiting for the fallout ...
@howard0su : thanks for your work on the tokenizer! I took the liberty to merge the relevant parts for the work on #2310. Would you like to check, if this is as envisioned by you? |
@ggerganov , @howard0su and probably others: need guidance with a couple of points:
|
I think we don't make any specific guarantees about thread-safety, but generally it would be good to assume that each instance of
|
This changeset is misleading. There are two words and each have a different explanation. for “Hello”, for a reason I don’t understand sentencepiece (the tokenizer used by llama) is adding a space at the beginning of every string that is tokenized. But it wasn’t the case in the way the test case was called (which I don’t know if it is the way it is called in “production”). This needs to be added in order to match what llama does. Note that it’s only going to change the very first token of the very first word so isn’t going to be a big deal for normal LLM use cases. for “world” the capitalization got changed from “World” to “world”. So it is expected that the token number is different. The previous one has the correct token id as far as I can tell. I need to check more of the inputs previously but it’s not as dire as what this change makes it look like. |
I see. So you think the change of token texts in
is the way to go? Does something like |
All the tests are looking good to me now. Thanks for looking into it! |
Is there a way to fix this without changing the |
There is a PR #1931 that may contain fixes and additions that should be included here to be merged to the gguf branch. |
I think this should be merged to the gguf branch since the gpt2 tokenizer that was added there may have functions that could be reused by the llama tokenizer. There is also a unicode implementation that could be reused. We could even make a unified tokenizer library supporting both gpt2 and llama. In the future we would probably need to add the replit sentencepiece tokenizer and others as needed. |
PR #2053 could be related too (#2420 mentions |
The tokenizers can be distinguished by https://huggingface.co/BAAI/Aquila-7B/blob/main/tokenizer_config.json Xgen uses tiktoken according to the model card, |
OK, if I understand you correctly I should extend |
OK. I'll look into it (although I'm not fully convinced: I once learned to fix bugs first). |
This is the gpt2 tokenizer in gguf (it should probably have a better structure and unified with the llama tokenizer): And here is the gpt-neox example using it: |
Yes, let's look into merging this in the |
#2553 unicode fixes |
Merged via |
This PR addresses @vjeux 's comment. The proposed changes are necessary to see reasonable results for the attached test cases.
To further support is_unknown, is_control, is_byte and is_unused and more special cases it seems reasonable (or necessary?) to extend the binary vocabulary format.