Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : more tokenizer fixes #2810

Merged
merged 13 commits into from
Aug 27, 2023
Merged

llama : more tokenizer fixes #2810

merged 13 commits into from
Aug 27, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Aug 26, 2023

TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize

In other words, if your code uses llama.cpp library for tokenization and is doing this:

     s.insert(0, 1, ' '); // add a space to match OG LLaMA tokenizer
     p = ::llama_tokenize(ctx, s, add_bos); 

It should now just do this:

     p = ::llama_tokenize(ctx, s, add_bos); 

note: by "user" below I mean developers. I.e. users of llama.cpp library

This PR should not change the tokenization results from master - it just improves the tokenizer API and fixes the decoding logic and the reason for the leading space in the detokenized results. It adds a few more tokenizer tests and a helper Python script to compare the behaviour of the 2 implementations. Also confirmed that wiki.test.raw tokenizes to 335688 tokens in exactly the same way as the OG Python tokenizer and fixed a minor bug in perplexity tool.

The main mistake we have been doing is to leave to the user to add the leading whitespace before calling llama_tokenize(). For example:

if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) {
// Add a space in front of the first character to match OG llama tokenizer behavior
params.prompt.insert(0, 1, ' ');
}

{
s.insert(0, 1, ' '); // add a space if it's the first
p = ::llama_tokenize(ctx, s, add_bos);

{
auto s = json_prompt.template get<std::string>();
s.insert(0, 1, ' '); // always add a first space
prompt_tokens = ::llama_tokenize(ctx, s, add_bos);
}

This leading space is indeed needed since without it, we do not get the same results as the OG Python tokenizer. It does not matter if the provided text has one or more on none leading spaces - we always must prefix it with a leading space. Therefore, this has to be part of the internal implementation to avoid the user (developer) forgetting to do it:

llama.cpp/llama.cpp

Lines 3391 to 3398 in e4324cb

case LLAMA_VOCAB_TYPE_SPM:
{
// without adding this leading whitespace, we do not get the same results as the original tokenizer
raw_text = " " + raw_text;
llm_tokenizer_spm tokenizer(vocab);
llama_escape_whitespace(raw_text);
tokenizer.tokenize(raw_text, output);

For example, we have technically always had a mistake in the perplexity tool because we haven't added the whitespace there:

const bool is_spm = llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM;
const bool add_bos = is_spm;
fprintf(stderr, "%s: tokenizing the input ..\n", __func__);
auto tokens = ::llama_tokenize(ctx, params.prompt, add_bos);
const int n_chunk_max = tokens.size() / params.n_ctx;

This shouldn't lead to any dramatic changes in the results, but it's better to have it the right way. With the current PR I verified that the produced tokens by llama.cpp match exactly the ones from Python for wiki.text.raw:

# tokenize with Python
python3 tests/test-tokenizer-0-llama.py ../llama2/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw

# tokenize with llama.cpp
make -j && ./bin/test-tokenizer-falcon.0 ../models/ggml-vocab-llama.gguf ./wikitext-2-raw/wiki.test.raw

# diff
diff wikitext-2-raw/wiki.test.raw.tok wikitext-2-raw/wiki.test.raw.tokcpp 

I've also update the C-style API with better name of the de-tokenizer function: llama_token_to_str -> llama_token_to_piece:

llama.cpp/llama.h

Lines 384 to 393 in e4324cb

// Token Id -> Piece.
// Uses the vocabulary in the provided context.
// Does not write null terminator to the buffer.
// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
LLAMA_API int llama_token_to_piece(
const struct llama_context * ctx,
llama_token token,
char * buf,
int length);

The common lib now provides a llama_detokenize() function that correctly detokenizes a vector of tokens into text, without adding a leading space at the start:

llama.cpp/common/common.h

Lines 119 to 137 in e4324cb

// tokenizes a string into a vector of tokens
// should work similar to Python's `tokenizer.encode`
std::vector<llama_token> llama_tokenize(
struct llama_context * ctx,
const std::string & text,
bool add_bos);
// tokenizes a token into a piece
// should work similar to Python's `tokenizer.id_to_piece`
std::string llama_token_to_piece(
const struct llama_context * ctx,
llama_token token);
// detokenizes a vector of tokens into a string
// should work similar to Python's `tokenizer.decode`
// removes the leading space from the first non-BOS token
std::string llama_detokenize(
llama_context * ctx,
const std::vector<llama_token> & tokens);

The logic is that we have to drop the leading space of the first non-BOS token when detokenizing:

llama.cpp/common/common.cpp

Lines 750 to 768 in e4324cb

std::string llama_detokenize(llama_context * ctx, const std::vector<llama_token> & tokens) {
const llama_token bos_id = llama_token_bos(ctx);
std::string piece;
std::string result;
for (size_t i = 0; i < tokens.size(); ++i) {
piece = llama_token_to_piece(ctx, tokens[i]);
// remove the leading space of the first non-BOS token
if (((tokens[0] == bos_id && i == 1) || (tokens[0] != bos_id && i == 0)) && piece[0] == ' ') {
piece = piece.substr(1);
}
result += piece;
}
return result;
}

This function should behave exactly the same as the Python's tokenizer.decode().

Notice that in main we still get a leading space inserted when we print out the input prompt. This is because we decode the generated text token by token - i.e. piece by piece. So we don't have the information which token is first. We can add a state to main.cpp to track this, but I think it is not worth it.

Another small bug fix is that we now correctly tokenize the empty string '' to [1] with bos == true and to [] with bos == false

Please give this branch a thorough test to make sure I haven't made some mistake and lets hope we have finally resolved all tokenizer related issues :)


TLDR: after this PR, devs should no longer add in the user code a leading space to the text they need to tokenize as we have been doing up until now

llama.cpp Outdated Show resolved Hide resolved
@ggerganov ggerganov marked this pull request as ready for review August 26, 2023 16:39
Copy link
Collaborator

@SlyEcho SlyEcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be okay with the server, at least.

@ghost
Copy link

ghost commented Aug 26, 2023

interactive in main looks good, no issues in my tests.

@ggerganov
Copy link
Owner Author

@SlyEcho

I'm looking into more details in this code:

else
{
if (first)
{
first = false;
}
prompt_tokens.push_back(p.template get<llama_token>());
}
}
}

I think it should be:

         { 
             if (first) 
             { 
                 prompt_tokens.push_back(bos_token);
                 first = false; 
             } 
             prompt_tokens.push_back(p.template get<llama_token>()); 
         } 

Not sure how important is this, but mentioning just in case. You will need to propagate bos_token from somewhere though

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

LF token is not correct, should not be tokenized using prepended space.

llm_load_print_meta: general.name   = llama2-7b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 29871 '▁'
llm_load_tensors: ggml ctx size =    0.09 MB

Master:

llm_load_print_meta: general.name   = llama2-7b-hf
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB

@ghost
Copy link

ghost commented Aug 26, 2023

LF token is not correct, should not be tokenized using prepended space.

I missed this, but yeah there's a difference for LF token in main between PR and master.

@ggerganov
Copy link
Owner Author

LF token is not correct, should not be tokenized using prepended space.

Ah good catch. The actual problem is that it should not be tokenized, but we should use piece_to_id function

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Now LF should work in both llama and falcon.

@ghost
Copy link

ghost commented Aug 26, 2023

Now LF should work in both llama and falcon.

llama2 now correctly showing LF token = 13 '<0x0A>' for main

Copy link
Contributor

@klosax klosax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work good.

@ggerganov
Copy link
Owner Author

ggerganov commented Aug 27, 2023

I have verified that this branch compared to master produces:

  • identical Hellaswag results for F16 Falcon 7B and F16 LLaMA-v2 7B
  • identical perplexity for Falcon 7B Q4_0
  • nearly identical perplexity for LLaMA-v2 7B Q4_0 (this is expected due to the new correct tokenization of wiki text):
    • master Q4_0: ppl 5.9693
    • PR Q4_0: ppl 5.9694

@ikawrakow
Copy link
Contributor

So, I approved the changes, but the PR basically goes back to auto-adding white space at the beginning strings to be tokenized, which changes HellaSwag scores. I.e., it kind of reverts @klosax's fix in #2806. With PR #2805 merged, this no longer leads to the dramatic difference we saw for HellaSwag after the GGUF merge, but still. I guess, the values in #2321 will need to be updated (for instance, we now have a HellaSwag score of 76.75 after 400 tasks for LLaMA-v2-7B instead of the 77.5 shown in the table in #2321).

@ggerganov
Copy link
Owner Author

ggerganov commented Aug 27, 2023

Yes, after digging into this yesterday, I believe that the proposed implementation in this PR is the correct one and the old values have to be updated. The auto-prefix with whitespace inside the llama_tokenize() implementation is required in order to match tokenizer.encode() behaviour in all cases and strings. So before the PR, we were mostly doing this manually in user-code, but not always. Now we should never need to prefix with whitespace in user-code (unless one wants to concatenate tokenization results, but in this case one must take into account the specific whitespace behaviour of the tokenizer and know how to perform it correctly -- in general, we should avoid concatenation of tokens as we currently do so).

It's still possible that I am misunderstanding some of the tokenizer logic, but atm I'm fairly confident that the proposal is better than what we have on master and my unit tests confirm this.

Edit: Regarding the concatenation comment, here is an example to try to make it clear:

Running this Python code first with LLaMA's SPM tokenizer and then with Falcon's BPE tokenizer:

print(tokenizer.encode('hello'))         # x, beginning
print(tokenizer.encode('world'))         # y, ending
print(tokenizer.encode(' world'))        # z, ending with whitespace prefix
print(tokenizer.encode('hello world'))   # r, concat

# results with SPM
[22172]             # x
[3186]              # y
[29871, 3186]       # z
[22172, 3186]       # r = [ x, y ]

# results with BPE
[30835]             # x
[9091]              # y
[1079]              # z
[30835, 1079]       # r = [ x, z ]

@ggerganov
Copy link
Owner Author

And one more observation regarding the built-in Falcon BPE tokenization in llama.cpp:

We currently know that it does not handle Unicode strings correctly. However, running the unit test, we can see that doing a roundtrip of (tokenize + detokenize) we always get the "source" string:

$ ./bin/test-tokenizer-0-falcon ../models/falcon-7b/ggml-model-f16.gguf 

src: ' '
res: ' '
tok: 204 

src: '  '
res: '  '
tok: 258 

src: '   '
res: '   '
tok: 466 

src: '    Hello'
res: '    Hello'
tok: 466 23090 

src: '    Hello
    Hello'
res: '    Hello
    Hello'
tok: 466 23090 742 23090 

src: '   Hello'
res: '   Hello'
tok: 258 23090 

src: '  Hello'
res: '  Hello'
tok: 204 23090 

src: ' Hello'
res: ' Hello'
tok: 23090 

src: ' Hello World'
res: ' Hello World'
tok: 23090 2889 

src: ' Hello World!'
res: ' Hello World!'
tok: 23090 2889 12 

src: ' Hello world'
res: ' Hello world'
tok: 23090 1079 

src: ' Hello, world!'
res: ' Hello, world!'
tok: 23090 23 1079 12 

src: ' this is 🦙.cpp'
res: ' this is 🦙.cpp'
tok: 414 304 204 182 237 111 231 25 29247 
main : failed test:    ' this is 🦙.cpp'
main : detokenized to: ' this is 🦙.cpp' instead of ' this is 🦙.cpp'
main : expected tokens:    414,    304,   3346,    111,    231,     25,  29247, 
main : got tokens:         414,    304,    204,    182,    237,    111,    231,     25,  29247, 

src: 'Hello'
res: 'Hello'
tok: 9856 

src: 'Hello World'
res: 'Hello World'
tok: 9856 2889 

src: 'Hello world'
res: 'Hello world'
tok: 9856 1079 

src: 'Hello, world!'
res: 'Hello, world!'
tok: 9856 23 1079 12 

src: 'w048 7tuijk dsdfhu'
res: 'w048 7tuijk dsdfhu'
tok: 98 55866 204 34 16682 7149 36190 6869 11481 

src: 'нещо на Български'
res: 'нещо на Български'
tok: 150 133 6207 151 215 150 134 204 150 133 6279 204 150 223 151 216 15401 150 123 6279 9310 9649 16153 7795 
main : failed test:    'нещо на Български'
main : detokenized to: 'нещо на Български' instead of 'нещо на Български'
main : expected tokens:    150,    133,   6207,    151,    215,    150,    134,   5052,    133,   6279,   5052,    223,    151,    216,  49679,    123,  53110,  47043,   7795, 
main : got tokens:         150,    133,   6207,    151,    215,    150,    134,    204,    150,    133,   6279,    204,    150,    223,    151,    216,  15401,    150,    123,   6279,   9310,   9649,  16153,   7795, 

src: 'កាន់តែពិសេសអាចខលចេញ'
res: 'កាន់តែពិសេសអាចខលចេញ'
tok: 167 236 206 167 236 126 167 236 225 167 237 217 167 236 221 167 237 208 167 236 228 167 236 127 167 236 237 167 237 207 167 236 237 167 236 107 167 236 126 167 236 211 167 236 207 167 236 233 167 236 211 167 237 207 167 236 215 
main : failed test:    'កាន់តែពិសេសអាចខលចេញ'
main : detokenized to: 'កាន់តែពិសេសអាចខលចេញ' instead of 'កាន់តែពិសេសអាចខលចេញ'
main : expected tokens:  38154,    206,  38154,    126,  38154,    225,    167,    237,    217,  38154,    221,    167,    237,    208,  38154,    228,  38154,    127,  38154,    237,    167,    237,    207,  38154,    237,  38154,    107,  38154,    126,  38154,    211,  38154,    207,  38154,    233,  38154,    211,    167,    237,    207,  38154,    215, 
main : got tokens:         167,    236,    206,    167,    236,    126,    167,    236,    225,    167,    237,    217,    167,    236,    221,    167,    237,    208,    167,    236,    228,    167,    236,    127,    167,    236,    237,    167,    237,    207,    167,    236,    237,    167,    236,    107,    167,    236,    126,    167,    236,    211,    167,    236,    207,    167,    236,    233,    167,    236,    211,    167,    237,    207,    167,    236,    215, 

src: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
res: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
tok: 182 237 232 206 204 19 11003 20 204 182 237 230 126 168 206 219 182 237 218 116 13392 204 19 51831 732 63209 1741 7955 522 20 204 57196 204 19 7927 53360 325 504 701 946 10930 20 
main : failed test:    '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : detokenized to: '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)' instead of '🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)'
main : expected tokens:   2571,    232,    206,    204,     19,  11003,     20,   8196,    126,    283,    219,  48778,    116,  13392,    204,     19,  51831,    732,  63209,   1741,   7955,    522,     20,  22438,    211,    204,     19,   7927,  53360,    325,    504,    701,    946,  10930,     20, 
main : got tokens:         182,    237,    232,    206,    204,     19,  11003,     20,    204,    182,    237,    230,    126,    168,    206,    219,    182,    237,    218,    116,  13392,    204,     19,  51831,    732,  63209,   1741,   7955,    522,     20,    204,  57196,    204,     19,   7927,  53360,    325,    504,    701,    946,  10930,     20, 

However, the obtained string is sometimes constructed from a different set of tokens that does not match the correct set that we obtain with the OG tokenizer. This happens when unicode characters are involved.

All of this is expected and indicates that for now, projects should avoid llama.cpp's built-in BPE tokenizer if tokenization accuracy is important. But for quick tests, the generated texts should look OK

@ggerganov ggerganov merged commit edd4c14 into master Aug 27, 2023
@ggerganov ggerganov deleted the fix-tokenizer branch August 27, 2023 11:19
@ikawrakow
Copy link
Contributor

Btw., did anyone except me notice that tokenization is now almost 6 times slower than it used to be?

After merging this PR, it now takes 3.2 seconds to tokenize Wikitext. It used to be ~540 ms up to bae5c5f, became 3.1 seconds after merging #2806, and now 3.2 seconds after this PR.

@ghost
Copy link

ghost commented Aug 27, 2023

I don't have an accurate measure for timing for tokenizing input in perplexity, but counting in my head took 23 seconds for both master & bae5c5f

If there's slowness, then I'd notice on Android unless my device is unaffected.

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* tests : write a Python tokenizer test (wip)

* llama : prefix input text for tokenization with whitespace

* llama : distinguish pieces from decoded text + fix detokenization

* common : add comments

* examples : no longer manually add leading space when tokenizing

* tests : use Python to generate tokenizer tests for C++

* tests : add option to tokenize text files

ggml-ci

* tests : add test-tokenizer-1.py

* llama.cpp : fix LF token

* hellaswag : move the concat space for clarity

* tests : add falcon tests (py + cpp, currently do not pass Unicode)

ggml-ci

* common : temporary separate llama_detokenize calls for SPM and BPE

---------

Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
@OthmanProgramming

This comment was marked as spam.

@OthmanProgramming

This comment was marked as spam.

@shibe2
Copy link
Contributor

shibe2 commented Sep 8, 2023

So how do I tokenize partial text for Lllama 2 now? I need tokenize(x+y)==tokenize(x)+tokenize(y) (given that ending of x and beginning of y don't form a token).

@goerch
Copy link
Collaborator

goerch commented Sep 9, 2023

So how do I tokenize partial text for Lllama 2 now? I need tokenize(x+y)==tokenize(x)+tokenize(y) (given that ending of x and beginning of y don't form a token).

Interesting. This is an invariant I haven't looked into yet. For now:

        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            for(auto j = 0; j < str.length(); ++ j) {
                auto ids1 = spp.EncodeAsIds(str.substr(0, j));
                auto ids2 = spp.EncodeAsIds(str.substr(j));
                auto str1 = spp.DecodeIds(ids1);
                auto str2 = spp.DecodeIds(ids2);
                if (str1 + str2 != str)
                    std::cout << "1: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
            }
        }
        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            auto ids = spp.EncodeAsIds(str);
            for(auto j = 0; j < ids.size(); ++j) {
                auto str1 = spp.DecodeIds(std::vector(ids.begin(), ids.begin() + j));
                auto str2 = spp.DecodeIds(std::vector(ids.begin() + j, ids.end()));
                if (str1 + str2 != str)
                    std::cout << "2: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
            }
        }

Invariant 1 seems to hold for Meta's tokenizer, invariant 2 breaks as soon as spaces are involved.

@goerch
Copy link
Collaborator

goerch commented Sep 9, 2023

If my quick check

        for(auto i = 0; i < 1024; ++ i) {
            auto str = random_string(64);
            auto ids = spp.EncodeAsIds(str);
            for(auto j = 0; j < str.length(); ++ j) {
                auto ids1 = spp.EncodeAsIds(str.substr(0, j));
                auto ids2 = spp.EncodeAsIds(str.substr(j));
                auto check = ids1;
                check.insert(check.end(), ids2.begin(), ids2.end());
                if (check != ids) {
                    auto str1 = spp.DecodeIds(ids1);
                    auto str2 = spp.DecodeIds(ids2);
                    std::cout << "3: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
                }
            }
        }

is correct, this invariant also doesn't hold for Meta's tokenizer in the presence of spaces.

@cztomsik cztomsik mentioned this pull request Sep 13, 2023
34 tasks
@shibe2
Copy link
Contributor

shibe2 commented Sep 15, 2023

So how do I tokenize partial text for Lllama 2 now?

I have found a way. I prefix the string with LF, then remove tokens up to and including the first llama_token_nl. I added a parameter to server /tokenize, which triggers this behavior (only effective for LLAMA_VOCAB_TYPE_SPM).

By the way, server documentation as of now says that this should be the default behavior:

Note that the special BOS token is not added in front of the text and also a space character is not inserted automatically as it is for /completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants