Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix tokenizer #2315

Closed
wants to merge 27 commits into from
Closed

llama : fix tokenizer #2315

wants to merge 27 commits into from

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Jul 21, 2023

This PR addresses @vjeux 's comment. The proposed changes are necessary to see reasonable results for the attached test cases.

To further support is_unknown, is_control, is_byte and is_unused and more special cases it seems reasonable (or necessary?) to extend the binary vocabulary format.

@goerch goerch changed the title Fix #2023 Fix part of #2023 Jul 21, 2023
@goerch goerch changed the title Fix part of #2023 Fix parts of #2023 Jul 21, 2023
@ggerganov ggerganov added the help wanted Extra attention is needed label Jul 22, 2023
@goerch goerch changed the title Fix parts of #2023 [WIP] Fix parts of #2023 Jul 22, 2023
goerch added 2 commits July 22, 2023 18:37
Adding @howard0su 's draft PR and prefix matching.
Now we see some resemblence to the Meta-Tokenizer, I think. Only problem: how to integrate this into `llama.cpp` kernel.
@ggerganov
Copy link
Owner

Hm, do I understand correctly, that even the simplest prompt of Hello world does not currently tokenize correctly on master?

If this is the case, this should be come very high priority.

Waiting for the fallout ...
@goerch goerch changed the title [WIP] Fix parts of #2023 [WIP] Fix parts of #2023 and #2310 Jul 23, 2023
@goerch
Copy link
Collaborator Author

goerch commented Jul 23, 2023

@howard0su : thanks for your work on the tokenizer! I took the liberty to merge the relevant parts for the work on #2310. Would you like to check, if this is as envisioned by you?

@goerch
Copy link
Collaborator Author

goerch commented Jul 23, 2023

@ggerganov , @howard0su and probably others: need guidance with a couple of points:

  • What should we do about the buffer reservation for tokenization? I estimate the need for two times the string size in the worst case?
  • What do we do with llama_token_to_str returning a std::string due to white space unescaping?
  • How should the unescaped tokens be displayed?

@slaren
Copy link
Collaborator

slaren commented Jul 23, 2023

What should we do about the buffer reservation for tokenization? I estimate the need for two times the string size in the worst case?

llama_tokenize returns the negated number of tokens, so that could be used to resize the buffer as needed, but allocating twice the string size would be an inconsequential amount of memory anyway, so I don't think it really matters.

What do we do with llama_token_to_str returning a std::string due to white space unescaping?

I think we don't make any specific guarantees about thread-safety, but generally it would be good to assume that each instance of llama_model, or any of the objects, should be thread safe with each other. So I think that only leaves two options:

  • Allocate a buffer in llama_model and return that
  • Take a buffer from the user

tests/test-tokenizer-0.cpp Outdated Show resolved Hide resolved
@vjeux
Copy link

vjeux commented Jul 24, 2023

Hm, do I understand correctly, that even the simplest prompt of Hello world does not currently tokenize correctly on master?

This changeset is misleading. There are two words and each have a different explanation.

for “Hello”, for a reason I don’t understand sentencepiece (the tokenizer used by llama) is adding a space at the beginning of every string that is tokenized. But it wasn’t the case in the way the test case was called (which I don’t know if it is the way it is called in “production”). This needs to be added in order to match what llama does. Note that it’s only going to change the very first token of the very first word so isn’t going to be a big deal for normal LLM use cases.

for “world” the capitalization got changed from “World” to “world”. So it is expected that the token number is different. The previous one has the correct token id as far as I can tell.

I need to check more of the inputs previously but it’s not as dire as what this change makes it look like.

@goerch
Copy link
Collaborator Author

goerch commented Jul 24, 2023

I need to check more of the inputs previously but it’s not as dire as what this change makes it look like.

I see. So you think the change of token texts in convert.py

         text: bytes
            if tokenizer.is_unknown(i):
                text = " \u2047 ".encode("utf-8")
            elif tokenizer.is_control(i):
                text = b""
            elif tokenizer.is_byte(i):
                piece = tokenizer.id_to_piece(i)
                if len(piece) != 6:
                    raise Exception(f"Invalid token: {piece}")
                byte_value = int(piece[3:-1], 16)
                text = struct.pack("B", byte_value)
            else:
                text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")

is the way to go? Does something like test-tokenizer-1 of the PR work for you? Does size(vocab.id_to_token) == size(vocab.token_to_id) (=32000 for llama) hold for you?

@goerch goerch changed the title [WIP] Fix parts of #2023 and #2310 Fix parts of #2023 and #2310 Jul 24, 2023
@vjeux
Copy link

vjeux commented Jul 24, 2023

All the tests are looking good to me now. Thanks for looking into it!

@ggerganov
Copy link
Owner

Is there a way to fix this without changing the convert.py script?
Otherwise, all ggml models would become obsolete after this change

@ggerganov ggerganov mentioned this pull request Jul 26, 2023
34 tasks
@ggerganov ggerganov changed the title Fix parts of #2023 and #2310 llama : fix tokenizer Jul 26, 2023
@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

There is a PR #1931 that may contain fixes and additions that should be included here to be merged to the gguf branch.

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

I think this should be merged to the gguf branch since the gpt2 tokenizer that was added there may have functions that could be reused by the llama tokenizer. There is also a unicode implementation that could be reused. We could even make a unified tokenizer library supporting both gpt2 and llama. In the future we would probably need to add the replit sentencepiece tokenizer and others as needed.

@goerch
Copy link
Collaborator Author

goerch commented Aug 7, 2023

There is a PR #1931 that may contain fixes and additions that should be included here to be merged to the gguf branch.

PR #2053 could be related too (#2420 mentions Aquila and xgen). My biggest concern currently is that the only way I know how to distinguish tokenizers is by size of the vocabulary, Do you know of any better way?

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

The tokenizers can be distinguished by tokenizer_class in tokenizer_config.json

https://huggingface.co/BAAI/Aquila-7B/blob/main/tokenizer_config.json
https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenizer_config.json

Xgen uses tiktoken according to the model card,

@goerch
Copy link
Collaborator Author

goerch commented Aug 7, 2023

The tokenizers can be distinguished by tokenizer_class in tokenizer_config.json

OK, if I understand you correctly I should extend convert.py to transfer the tokenizer_class into the ggml format? I could try that, but it might be a duplication of efforts with the gguf branch. I can only speculate which way forward (this PR or a corresponding one against gguf) gets the better test coverage?

@goerch
Copy link
Collaborator Author

goerch commented Aug 7, 2023

I think this should be merged to the gguf branch since the gpt2 tokenizer that was added there may have functions that could be reused by the llama tokenizer. There is also a unicode implementation that could be reused. We could even make a unified tokenizer library supporting both gpt2 and llama. In the future we would probably need to add the replit sentencepiece tokenizer and others as needed.

OK. I'll look into it (although I'm not fully convinced: I once learned to fix bugs first).

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

This is the gpt2 tokenizer in gguf (it should probably have a better structure and unified with the llama tokenizer):
https://github.com/ggerganov/llama.cpp/blob/gguf/cmpnct_gpt2bpe.hpp

And here is the gpt-neox example using it:
https://github.com/ggerganov/llama.cpp/blob/gguf/gptneox-main.cpp

@goerch goerch mentioned this pull request Aug 8, 2023
@ggerganov
Copy link
Owner

I think this should be merged to the gguf branch

Yes, let's look into merging this in the gguf branch

@klosax
Copy link
Contributor

klosax commented Aug 8, 2023

#2553 unicode fixes

@goerch
Copy link
Collaborator Author

goerch commented Aug 21, 2023

Merged via gguf branch.

@goerch goerch closed this Aug 21, 2023
@goerch goerch deleted the fix-2023 branch September 4, 2023 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants