-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer : special token handling #3538
Conversation
llama.cpp
Outdated
// are special tokens. | ||
// From testing, this appears to corelate 1:1 with special tokens. | ||
// | ||
for (const auto & t: vocab.token_to_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like we have a redundancy here. We have token_data.type
which is supposed to tell us if a token is special or not. In which cases this wouldn't work?
I guess we can have this piece of code here as a sanity check that the special tokens that we have read from the vocabulary are indeed special, but ideally AFAIU we shouldn't need this code in order to function correctly, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.
@goerch wasn't certain about token types in bpe PR, so I opted for finding special tokens manually for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.
I'm open for critique, but you have to clarify what you mean by "this" and "terrible" :) otherwise I'm not sure how to proceed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.
Ok I see. We can probably combine both the tokens marked as special and those that are unmatchable by other tokens. And probably print a warning if there is a mismatch between the 2 sets.
Let's see if @goerch has anything else to add and then I would like to make a pass over the implementation to update the coding style to match better. After that we can merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure about our nomenclature, do you mean special tokens == added tokens? I'm mostly leaning towards the nomenclature used in HF's tokenizer.json
which contains added_tokens
with the is_special
attribute. From @jploski I learned that these added tokens are probably very similar to sentence_piece
USER_DEFINED
tokens.
And one more question: do we agree that vocab.json
, merges.txt
, added_tokens.json
and special_tokens_map.json
are probably older HF (or GPT2) serialization formats and we should find all this information in tokenizer.json
for newer models (paging @apage43 too because he knows a lot about this stuff)? W.r.t. to these serialization formats we also seem to have some reports indicating that we don't support both of them equally well.
Edit: just adding a hunch: is it possible that tokenizer.json
is the serialization format invented for the fast HF tokenizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean special tokens == added tokens?
Oh my, you better grab a chair because you are in for an adventure with this question :)
A really condensed version of what I learned so far:
I couldn't find any clear definition or explanation, but from what I found, there seem to be only one way all pieces of this puzzle fall together to form a clear picture.
A vocab, naturally forms a tree like structure ( several trees actually, there is no common root )
Tokenizer does it's thing, by grabbing small portions of the input text, and trying to steal neighboring portions to form a valid token, rinse and repeat, until nothing can be borrowed from either left or right to merge with and form a valid token.
So basically, tokenizer climbs that tokens tree down from single bytes to full tokens, stepping only on valid branches ( tokens )
Applying this idea backwards, by checking the entire vocab token by token, trying to split its text representation in two and checking if it still forms two valid tokens, you can clearly see some of the tokens in the vocab are not part of that tree family, and will never be matched by the tokenizer.
Which happens to match perfectly with normal tokens being marked as LLAMA_TOKEN_TYPE_NORMAL
in the vocab, and tokens not being part of the tokens tree family being marked as whatever type being not LLAMA_TOKEN_TYPE_NORMAL
Therefore, I'm using the term special token, refering to tokens which are "hidden" from the user and un-matchable by tokenizer.
From the practical point of view, that subset of "special" tokens contains exactly what one would expect, <s>
,</s>
,<unk>
and in case of mistral openorca <|im_start|>
,<|im_end|>
, AKA tokens which are used to control the flow of inference. ( EDIT: plus byte tokens <0xXX>
)
And so from the point of view of actual inference, the tokenizer has to be extended the same way, independently of the actual token type, because only normal vs not normal token type matter at that point.
Additional distinction between USER_DEFINED, CONTROL etc can and has to be done the same way, with a tokenizer "pre-matcher", at which point per token type decisions or actions are trivial to implement if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time! My mental image is slightly different: I believe to understand the initial training of the tokenizer leads to an initial vocabulary and later model refinement tries to extend this basic vocabulary with added tokens without retraining the tokenizer on the original dataset. These added tokens should be intercepted before the original tokenizer interpretation. In my mental image special tokens somehow describe the effect these added tokens have on detokenization, i.e. should they be displayed or not for example. But I might be completely off here.
Edit:
<s>,</s>,<unk>
and in case of mistral openorca<|im_start|>,<|im_end|>
, AKA tokens which are used to control the flow of inference.
Here I think of <s>,</s>,<unk>
as original CONTROL tokens and <|im_start|>,<|im_end|>
as USER_DEFINED tokens which shouldn't be visible after detokinization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )
From what I understand, in tokenizer specifically ( ignoring detokenizer ), normal tokens added the way you say, would simply be tokenized one step further into a longer token instead of being represented by a sequence of smaller tokens.
If such added tokens, are not integrated with the tree structure of the vocab, user would not be able to use them
If however those added tokens are properly coherent with the vocab, tokenizer would match them already, as it is
Which to me sort of seems to defy the purpose of adding "normal" tokens this way
I might be wrong about this part, but that still means a "tokenizer preprocessor" of some sort is needed, and this PR fits that role
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )
I believe they have to be processed before the core tokenizer routine in something like prefix_match
(only exception are the already defined core tokenizer CONTROL tokens). mpt
added tokens for example (from tokenizer.json
):
{
"id": 0,
"content": "<|endoftext|>", <-- core, i.e CONTROL
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|padding|>", <-- unsure about this because it is a valid GPT2 token AFAIU
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 50254,
"content": " ", <-- extension, i.e USER_DEFINED
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
[...]
I'm gonna be back at it today or tomorrow, I went on a slight tangent of my hardware cleanup because I wanted to compare this approach with reference python code, and free colab was oom-ing on me, so I'm finishing rearranging and reinstalling my servers right now, for more RAM. I already rewrote matching to not use string copy, just didn't push it yet. I just need to determine how the hardcoded space prefix should work with "segmented" input text, because I don't think it's supposed to be there if the input text starts with a special token. ( #3503 (comment) ) |
yes for anything that was trained with HF wrt added tokens vs special tokens, using the terms as they are used in the
|
This indicates what I feared: if we stay with Edit on second thought : regarding my question "can a |
No, in sentencepiece the "user defined symbols" are also matched as whole before the main BPE loop begins, using normalizer::PrefixMatcher: |
Appears to be working without issues, though in interactive mode in
|
@staviq This is ready for review, correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this is ready to merge. Would be useful to post a command used for testing so we can run some tests
I'll re-check everything after work I definitely left special tokens enabled in There was an issue with swift build because I think default argument values don't work correctly there ( CI failed for swift ) Otherwise if @goerch doesn't have anything to add or change, this should be pretty much ready |
I cleaned up I also made If CI passes, there is only one thing left which I'm not sure if it's correct, and that's the hard-coded space prefix in Edit: Special tokens aren't meant to be exposed to the user anyway, so tokenization with special tokens should only be done intentionally, which to me seems like it implies wanting full control over tokenizer behaviour, it would let devs decide for themselves if they want space prefix or not when processing prompt template etc ? This would pretty much be the extent of this modification: index 1f82dcb..b571d4e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -6727,7 +6727,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
// by modifying llm_tokenizer_x to operate with string offsets like pre-tokenizer
// and passing 'add space prefix' as bool argument
//
- auto raw_text = " " + fragment.raw_text.substr(fragment.offset, fragment.length);
+ auto raw_text = (special?"":" ") + fragment.raw_text.substr(fragment.offset, fragment.length);
#ifdef PRETOKENIZERDEBUG
fprintf(stderr,"TT: (%ld %ld %ld) '%s'\n", raw_text.length(), fragment.offset, fragment.length, raw_text.c_str()); |
I played a bit with hardcoded space prefix always vs only for user input itself, and I'm getting subjectively much much better results, if the hardcoded space is only added to the user input itself and not the "prompt template" ( prefix and suffix ). Default tokenizer behavior occasionally leads to things like this, where the model gets lost and breaks the prompt format: With the modification proposed above, It pretty much never breaks the prompt format, and also stops unnecessarily outputting empty lines or breaking the answer into paragraphs: |
Are there models that can output special tokens besides EOS? |
Lots of recent mistral based models using ChatML, because they are using |
@staviq Which such model can I download to play with? Using ChatML for input does not imply ability of producing same special tokens in output. |
@staviq Thanks for the insightful info, that explains a lot. So we've been using it wrong all this time and few people noticed. @ggerganov Well, the last time I wrote C++ code was in 2000 when I worked on the Unreal Engine, so I don't feel qualified to patch the repetition penalty. Do you want me to open an issue for that or is anyone already up for the task? @staviq (again :)) Is that a bug or a feature when the ChatML-formatted models use both? Are |
Uhmm, actually I can't reproduce that after special token handling merge, so I'm not sure now if I'm misremembering this or it was something specific I tested 3 weeks ago that I'm doing different now Assume I was wrong about this. |
Just FYI: There's now an issue for it on the Transformers repo: huggingface/transformers#26902 Hopefully this repetition penalty issue will be fixed for all inference software soon... |
Interestingly enough, "<|im_end|>" isn't just an added or special token, it's also the EOS token in Mistral-OpenOrca... So a multi-turn chat contains multiple EOS tokens, which by itself feels like an abuse of the concept of "EOS". But it doesn't change anything with respect to the problem of imposing an unwanted repetition penalty on such tokens. Note that this same problem also exists in chat models that use normal tokens for separating turns (the good old "reverse prompt" parameter in llama.cpp's terms, such as "### Assistant:" - I suppose those would get repetition-penalized as well). |
@jploski Yes, this has been a problem all this time. Glad it's getting the necessary attention now. I used to think that the EOS token should never be part of the prompt, instead it should only be output by the model to end inference. The inference backend would catch it, stop inferencing, and remove it before sending the output to the frontend. That way it would never show up in the next prompt, and usually the Now that the ChatML format is getting traction and might turn into a standard prompt template, the issue has become more apparent. But yes, the Same with the LLM outputting a character's name as a prefix for its message when chatting. After a while, it would fail to do so and break the chat. The solution to that was to have the frontend manipulate the prompt and add the character name itself, so the model only completes after that and never has to write the name by itself. |
I've been running into this issue in the templates. I've run several tests and it's usually always the same outcome, regardless of the model. I know this PR is adding support for special tokens, but how were the special tokens handled internally before hand? I haven't had the opportunity to dig into the code yet. |
I experimented with Mistral-7B-OpenOrca. It indeed outputs im_end. |
Yep, all chatml models made by axolotl uses <|im_end|> to seperate turns, as well as EOS token. I havent noticed an issue with 13~ turns when inferencing Hermes 2, but I think its mostly due to the long responses it gives in my turns, which makes the start/end/assistant/user tokens that show up each turn less disasterous. I think this effect will be most prominent in short token turns - i.e. roleplaying or a model who is trained to answer in as few words as possible type scenarios, but no doubt this issue has been affecting all multiturn models in some way, whether subtlely or not since at least the first sharegpt/vicuna models came around, without anyone being aware until very recently somehow, probably due to EOS tokens being throughout all turns in chatml making the problem even more exacerbated Also tagging huggingface/text-generation-inference#1170 here so TGI devs can see the full context between hf tgi and here. Someone should probably loop in VLLM peeps if we want to get this covered across the board, I think they only have presence penalty but I assume the same issue happens with that? |
Exactly. I'm one of those chatters/roleplayers who have been experiencing these issues (responses getting longer, less coherent, until eventually derailing) since the beginning (Alpaca/Vicuna). Used to think it's just a limit of the technology and worked around it by setting various permutations of the EOS token as stopping strings. So glad this is finally getting the attention it deserves and hopefully will be fixed globally. That should make our local/open LLMs much more intelligent and useful once they work as intended and no longer hallucinate/derail that much. Especially the smaller models that tend to require closer adherence to the trained/tuned prompt format than bigger models which tend to cope better (but even they should profit). |
* 'master' of github.com:ggerganov/llama.cpp: fix embeddings when using CUDA (ggerganov#3657) llama : avoid fprintf in favor of LLAMA_LOG (ggerganov#3538) readme : update hot-topics & models, detail windows release in usage (ggerganov#3615) CLBlast: Fix temporary buffer size for f16 conversion (wsize) train-text-from-scratch : fix assert failure in ggml-alloc (ggerganov#3618) editorconfig : remove trailing spaces server : documentation of JSON return value of /completion endpoint (ggerganov#3632) save-load-state : fix example + add ci test (ggerganov#3655) readme : add Aquila2 links (ggerganov#3610) tokenizer : special token handling (ggerganov#3538) k-quants : fix quantization ranges (ggerganov#3646) llava : fix tokenization to not add bos between image embeddings and user prompt (ggerganov#3645) MPT : support GQA for replit-code-v1.5 (ggerganov#3627) Honor -ngl option for Cuda offloading in llava (ggerganov#3621)
I doubt it will have that much effect because if repetition penalty messing up generation was the issue, we should also be seeing it for newline tokens (much more frequent than EOS / next turn tokens). The derailment is more likely due to paying equal attention to their own (garbage) outputs as anything else and eventually sampling more and more out of training distribution as a result. So yes, I'd say it is an inherent limitation of the technology. Hallucinations are also out-of-distribution sampling (to prevent you'd either have to specifically train on such garbage samples or maybe detect occurrences, maybe by scoring semantical proximity of multiple outputs of the same prompt; e.g. if the model is happy to generate "yes" and "no" to the same question with same probability, it's safe to infer it does not know the answer). As for the repetition penalty, it is not really all that much needed if you sample from the entire distribution (high top_p) rather than greedily. The reason why repetitions happen with greedy sampling: imagine you have a biased coin (your LLM) and with each prediction you only pick the most probable outcome. With such a sampling method you'd only ever get heads or tails, but never a realistic mix of them generated by such a coin. (Note: top_k can also be very greedy if all the top k tokens are synonyms. This description also generalizes to multi-token generation, imagine throwing n different biased coins; then you get the GPT-style sentence-level loops). Finally, the repetition seems to also have someting to do with the amplitude swings in attention scores in the positional embeddings (both original GPT and RoPE).. based on anecdotal evidence that in a model with different positional embeddings (RetNet + XPos), repetitions also look different (individual token stutter - but followed by auto-recovery, no endless self-reinforcing looping). |
Very insightful post! Just to add my observations:
I think that's also visible. I used examples split into multiple paragraphs, and the splitting became inconsistent, probably because the newline became penalized. Same with other characters like dashes, quotes, asterisks - when used a lot, the model would suddenly switch, using different unicode symbols for dashes, quotes, or even switch from I think we need a better solution than simple (stupid) repetition penalty. There's also relevant discussion here: huggingface/transformers#26902 |
// TODO: It is unclear (to me) at this point, whether special tokes are guaranteed to be of a deterministic type, | ||
// and will always be correctly labeled in 'added_tokens.json' etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have worded that code comment a bit wrong, I didn't mean added_tokens.json
in particular, but rather source model metadata and the way it's processed by import scripts in general.
BPE models are still problematic, at least couple of them I tested this with, had no token types imported/defined at all. So my suspicions were correct, and at this point in time that TODO is still reasonably valid.
Though the method used in this PR to detect special tokens, was/is meant to only be used as a fallback, eventually.
@staviq Are we sure this was the right call? We should look into what HF transformers does. Command: Before this PR (with a leading space, GitHub doesn't show it):
After this PR (no leading space):
|
The problem was that making special tokens work, allowed proper prompt formats to also work, except when the space prefix was added to So it was a tradeoff between making prompt formats work better and making There was some talk about splitting/changing Feel free to revert it, if it's causing problems. I can't think if a way to determine whether |
If a space is desirable but missing, the user can add it. If a space is added automatically but is undesirable, the user is out of luck. |
@staviq As for llama, this is what HF transformers does (I assume we should match the legacy=False behavior): """
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
""" if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
tokens = tokens[1:] And an interesting note about add_dummy_prefix (changed in this PR): """
Returns a tokenized string.
We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
""" I think it is important to fix this regression for models that do not use special tokens in the prompt (including all of the ones that I use). |
If you don't want to add space after special tokens, only add it to a fragment if it precedes all special tokens except BOS, i.e. on the first iteration over |
Hey 🤗 The |
Is there any documentation about that? |
Special token handling, based on #1931, #3475
Works, but it's definitely not ready, just posting for feedback. Has some testing code meant to be removed before undrafting. Not optimized at all, just completed to the point of working.main
for prompt format arguments