llama : fix tokenizer #2315

goerch · 2023-07-21T22:54:13Z

This PR addresses @vjeux 's comment. The proposed changes are necessary to see reasonable results for the attached test cases.

To further support is_unknown, is_control, is_byte and is_unused and more special cases it seems reasonable (or necessary?) to extend the binary vocabulary format.

@howard0su

Adding @howard0su 's draft PR and prefix matching.

Now we see some resemblence to the Meta-Tokenizer, I think. Only problem: how to integrate this into `llama.cpp` kernel.

ggerganov · 2023-07-23T11:22:22Z

Hm, do I understand correctly, that even the simplest prompt of Hello world does not currently tokenize correctly on master?

If this is the case, this should be come very high priority.

Waiting for the fallout ...

goerch · 2023-07-23T16:22:49Z

@howard0su : thanks for your work on the tokenizer! I took the liberty to merge the relevant parts for the work on #2310. Would you like to check, if this is as envisioned by you?

goerch · 2023-07-23T16:58:41Z

@ggerganov , @howard0su and probably others: need guidance with a couple of points:

What should we do about the buffer reservation for tokenization? I estimate the need for two times the string size in the worst case?
What do we do with llama_token_to_str returning a std::string due to white space unescaping?
How should the unescaped tokens be displayed?

slaren · 2023-07-23T17:13:17Z

What should we do about the buffer reservation for tokenization? I estimate the need for two times the string size in the worst case?

llama_tokenize returns the negated number of tokens, so that could be used to resize the buffer as needed, but allocating twice the string size would be an inconsequential amount of memory anyway, so I don't think it really matters.

What do we do with llama_token_to_str returning a std::string due to white space unescaping?

I think we don't make any specific guarantees about thread-safety, but generally it would be good to assume that each instance of llama_model, or any of the objects, should be thread safe with each other. So I think that only leaves two options:

Allocate a buffer in llama_model and return that
Take a buffer from the user

tests/test-tokenizer-0.cpp

vjeux · 2023-07-24T04:48:14Z

Hm, do I understand correctly, that even the simplest prompt of Hello world does not currently tokenize correctly on master?

This changeset is misleading. There are two words and each have a different explanation.

for “Hello”, for a reason I don’t understand sentencepiece (the tokenizer used by llama) is adding a space at the beginning of every string that is tokenized. But it wasn’t the case in the way the test case was called (which I don’t know if it is the way it is called in “production”). This needs to be added in order to match what llama does. Note that it’s only going to change the very first token of the very first word so isn’t going to be a big deal for normal LLM use cases.

for “world” the capitalization got changed from “World” to “world”. So it is expected that the token number is different. The previous one has the correct token id as far as I can tell.

I need to check more of the inputs previously but it’s not as dire as what this change makes it look like.

goerch · 2023-07-24T04:59:29Z

I need to check more of the inputs previously but it’s not as dire as what this change makes it look like.

I see. So you think the change of token texts in convert.py

         text: bytes
            if tokenizer.is_unknown(i):
                text = " \u2047 ".encode("utf-8")
            elif tokenizer.is_control(i):
                text = b""
            elif tokenizer.is_byte(i):
                piece = tokenizer.id_to_piece(i)
                if len(piece) != 6:
                    raise Exception(f"Invalid token: {piece}")
                byte_value = int(piece[3:-1], 16)
                text = struct.pack("B", byte_value)
            else:
                text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")

is the way to go? Does something like test-tokenizer-1 of the PR work for you? Does size(vocab.id_to_token) == size(vocab.token_to_id) (=32000 for llama) hold for you?

vjeux · 2023-07-24T09:17:19Z

All the tests are looking good to me now. Thanks for looking into it!

ggerganov · 2023-07-24T10:27:33Z

Is there a way to fix this without changing the convert.py script?
Otherwise, all ggml models would become obsolete after this change

…llama_tokenize

klosax · 2023-08-07T19:01:26Z

There is a PR #1931 that may contain fixes and additions that should be included here to be merged to the gguf branch.

klosax · 2023-08-07T19:38:42Z

I think this should be merged to the gguf branch since the gpt2 tokenizer that was added there may have functions that could be reused by the llama tokenizer. There is also a unicode implementation that could be reused. We could even make a unified tokenizer library supporting both gpt2 and llama. In the future we would probably need to add the replit sentencepiece tokenizer and others as needed.

goerch · 2023-08-07T19:41:43Z

There is a PR #1931 that may contain fixes and additions that should be included here to be merged to the gguf branch.

PR #2053 could be related too (#2420 mentions Aquila and xgen). My biggest concern currently is that the only way I know how to distinguish tokenizers is by size of the vocabulary, Do you know of any better way?

klosax · 2023-08-07T19:56:39Z

The tokenizers can be distinguished by tokenizer_class in tokenizer_config.json

https://huggingface.co/BAAI/Aquila-7B/blob/main/tokenizer_config.json
https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenizer_config.json

Xgen uses tiktoken according to the model card,

goerch · 2023-08-07T20:04:38Z

The tokenizers can be distinguished by tokenizer_class in tokenizer_config.json

OK, if I understand you correctly I should extend convert.py to transfer the tokenizer_class into the ggml format? I could try that, but it might be a duplication of efforts with the gguf branch. I can only speculate which way forward (this PR or a corresponding one against gguf) gets the better test coverage?

goerch · 2023-08-07T20:13:19Z

I think this should be merged to the gguf branch since the gpt2 tokenizer that was added there may have functions that could be reused by the llama tokenizer. There is also a unicode implementation that could be reused. We could even make a unified tokenizer library supporting both gpt2 and llama. In the future we would probably need to add the replit sentencepiece tokenizer and others as needed.

OK. I'll look into it (although I'm not fully convinced: I once learned to fix bugs first).

klosax · 2023-08-07T20:23:27Z

This is the gpt2 tokenizer in gguf (it should probably have a better structure and unified with the llama tokenizer):
https://github.com/ggerganov/llama.cpp/blob/gguf/cmpnct_gpt2bpe.hpp

And here is the gpt-neox example using it:
https://github.com/ggerganov/llama.cpp/blob/gguf/gptneox-main.cpp

ggerganov · 2023-08-08T11:13:34Z

I think this should be merged to the gguf branch

Yes, let's look into merging this in the gguf branch

klosax · 2023-08-08T18:04:45Z

#2553 unicode fixes

goerch · 2023-08-21T20:20:19Z

Merged via gguf branch.

goerch added 2 commits July 22, 2023 00:32

Fix for ggerganov#2023

ac793a2

Fix typo

8c9d1e7

goerch changed the title ~~Fix #2023~~ Fix part of #2023 Jul 21, 2023

goerch changed the title ~~Fix part of #2023~~ Fix parts of #2023 Jul 21, 2023

Add missing include

9f055e3

ggerganov added the help wanted Extra attention is needed label Jul 22, 2023

goerch added 2 commits July 22, 2023 12:44

Replace VLA with std::vector

bf665cc

Add possibly missing typename

c8ae817

goerch changed the title ~~Fix parts of #2023~~ [WIP] Fix parts of #2023 Jul 22, 2023

goerch added 2 commits July 22, 2023 18:37

More testing of the tokenizer

94a0ee1

Adding @howard0su 's draft PR and prefix matching.

Added whitespace escaping and unescaping

0e74a72

Now we see some resemblence to the Meta-Tokenizer, I think. Only problem: how to integrate this into `llama.cpp` kernel.

Fix for ggerganov#2310

e6b1a50

Waiting for the fallout ...

goerch changed the title ~~[WIP] Fix parts of #2023~~ [WIP] Fix parts of #2023 and #2310 Jul 23, 2023

One more test case...

dba8369

vjeux reviewed Jul 24, 2023

View reviewed changes

tests/test-tokenizer-0.cpp Outdated Show resolved Hide resolved

vjeux reviewed Jul 24, 2023

View reviewed changes

tests/test-tokenizer-0.cpp Show resolved Hide resolved

goerch added 2 commits July 24, 2023 08:05

Fix C linkage for llama_token_to_str

b97a505

Fixing llama_token_to_str for the different sentence_piece token types

81fae1d

goerch changed the title ~~[WIP] Fix parts of #2023 and #2310~~ Fix parts of #2023 and #2310 Jul 24, 2023

goerch added 3 commits July 24, 2023 09:45

Fixing tests

281a4b4

Remove comment

a0d28b2

Added test cases

39c9a3b

ggerganov mentioned this pull request Jul 25, 2023

GGUF file format specification ggerganov/ggml#302

Merged

goerch added 4 commits July 25, 2023 17:49

Remove llama.cpp.h

e68580f

Merge branch 'master' into fix-2023

3bdf106

Resolve merge conflict with grammar stuff.

b4a5461

Fix static declarations

de41d5e

ggerganov mentioned this pull request Jul 26, 2023

GGUF #2398

Merged

34 tasks

ggerganov changed the title ~~Fix parts of #2023 and #2310~~ llama : fix tokenizer Jul 26, 2023

jxy mentioned this pull request Jul 26, 2023

server: allow json array in prompt or content for direct token input #2306

Merged

goerch added 2 commits August 6, 2023 05:55

Fixing function ordering issue

30a0e4c

Fix tokenizer regression in convert.py and improve CPP interface for …

1b54429

…llama_tokenize

goerch mentioned this pull request Aug 6, 2023

supporting more diverse tokenizers #2420

Merged

goerch added 3 commits August 6, 2023 13:24

Adding support for Aquila (GPT2?) tokenizer.

19e950f

Simplifying an expression.

bb6a58d

Remove inactive code.

5d52192

Merge branch 'master' into fix-2023

38fbb74

goerch mentioned this pull request Aug 7, 2023

Improve support for special tokens #1931

Closed

Split BPE and SentencePiece vocabularies

f1f85de

goerch mentioned this pull request Aug 8, 2023

Merge tokenizer fixes #2549

Merged

goerch closed this Aug 21, 2023

goerch deleted the fix-2023 branch September 4, 2023 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix tokenizer #2315

llama : fix tokenizer #2315

goerch commented Jul 21, 2023 •

edited

Loading

ggerganov commented Jul 23, 2023

goerch commented Jul 23, 2023

goerch commented Jul 23, 2023

slaren commented Jul 23, 2023

vjeux commented Jul 24, 2023

goerch commented Jul 24, 2023

vjeux commented Jul 24, 2023

ggerganov commented Jul 24, 2023

klosax commented Aug 7, 2023

klosax commented Aug 7, 2023

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

goerch commented Aug 7, 2023

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

ggerganov commented Aug 8, 2023

klosax commented Aug 8, 2023

goerch commented Aug 21, 2023

llama : fix tokenizer #2315

llama : fix tokenizer #2315

Conversation

goerch commented Jul 21, 2023 • edited Loading

ggerganov commented Jul 23, 2023

goerch commented Jul 23, 2023

goerch commented Jul 23, 2023

slaren commented Jul 23, 2023

vjeux commented Jul 24, 2023

goerch commented Jul 24, 2023

vjeux commented Jul 24, 2023

ggerganov commented Jul 24, 2023

klosax commented Aug 7, 2023

klosax commented Aug 7, 2023

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

goerch commented Aug 7, 2023

goerch commented Aug 7, 2023

klosax commented Aug 7, 2023

ggerganov commented Aug 8, 2023

klosax commented Aug 8, 2023

goerch commented Aug 21, 2023

goerch commented Jul 21, 2023 •

edited

Loading