-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #11
Comments
I tried to determine how to implement unicode and I am not getting far. It seems to work from all I am seeing, but the output has random characters yes. Here is a prompt in text format for easier copy/paste
This seems correct above since I dumped out the tokens parsing code
And the output I get is
So it is outputting some characters but some �
|
I find a list of unprintable tokens from ID 131 to 258. If I remove those from vocab a prompt can generate in Japanese it seems but I dont know Japanese!
Response
Google translate
Is it possible? |
Fixes ggerganov#11 This fixes a Japanese prompt I was attempting to run EG: `./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'` Output before change: `人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]` So it is outputting some characters but some � Output after change: `人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
Fixes ggerganov#11 This fixes a Japanese prompt I was attempting to run EG: `./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'` Output before change: `人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]` So it is outputting some characters but some � Output after change: `人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
The Japanese text you quote here is fairly agrammatical, in a way that suggests (on top of some other issues that I figure are simply due to LLaMa not having learned the language very well) that some words are simply missing. Where were the unprintable tokens that you removed from this? |
I removed "", "�", "��" from the grammar, not from a sentence that's not how it works. There is a large chunk of the "token dictionary" in the model that points to unprintable character �. I remove those tokens from the dictionary of tokens the program is using. I suspect the model learns some corrupted text maybe during training so if it sees japanese characters it is confusing it with some garbled text it has come across, thus making unprintable characters a likely candidate for the next word. Just my hypothesis. Here is the pull request, the code change I made to make this work. |
For anyone interested here is the chunk in the 13B model file. Not sure if all models contain the same token grammars
Many token IDs point to |
Nice find! Due to the constently changing encoding history of CJK (Chinese, Japanese, Korean), there is big chance that the training model got wrong encoding of non-ascii language. Simply removing it is good. |
Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet. https://github.com/beiller/llama.cpp Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate) |
Heres some more examples:
Outputs:
|
Thank you. I build your
Thank you. I build your repo and test again, the unprintable character is gone, but the meaning of generated text is gone either like bellow. |
There is another bug, truncate of prompt if it is Chinese like in #11 (comment) |
I just tried Chinese as well and yes its truncated. Its possible that it doesn't understand other languages. It seems to be missing some Chinese character tokens such as Further up the code chain, in the model conversion code I see the following. Before I write more @ggerganov thank you so much for putting this all together. I wonder if some tokens are getting lost. But maybe not since there is 32000 tokens (and that appears to be how Google's tokenizer works). I will try to research and see if some tokens are "lost in translation"!
@ggerganov you are too hard on yourself. How can you be wrong when so many tokens are present :P |
I found the problem via some scripting. The tokenizer works differently than we are using it. Also, token
So it seems we need to leverage this tokenizer in the C++ code, the current method of tokenizing is not correct. I can attemp it, it will require adding sentencepiece. The crux of the issue if I can try to explain, is the C++ tries to find the best matching token (single token) in the input text. But as we see here, the actual token for this character needs to be multiple tokens! Strange. I think the tokens can be removed from the model files in the conversion script and we should just use sentencepiece C++ code. Thoughts?? |
dump the tokenizer.model file to text by sp = spm.SentencePieceProcessor() vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())] with open('vocab.txt', 'w', encoding='utf-8') as f: did not found some char like '篇', '雨','许' |
@wizd see my comment its more complex it seems and will "translate" to multiple tokens but it will actually support Chinese I believe with some big refactors :P I may have time to make it work we will see! Edit I think we just found out what that big chunk of unprintable characters is for :) [29871, 234, 178, 138] translates to: But in actuality it should be:
|
trying to understand it... |
seems we should use this library to tokenize: https://github.com/google/sentencepiece |
@wizd yes that is correct. And the code also assumes a 1 "word" to 1 token mapping which isn't the case. Also "word" is not a word its more like a word piece. |
Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages |
The code has no problem with UTF-8 so far. I am working on a very hacky solution right now :) |
I actually got it working in a very hacky way. Example:
Output:
Output (Admittedly cherry picked, sometimes it contains half english):
|
wow, you are so cool! @beiller |
Another interesting outcome, it actually can output emojis now!
Sadly the joke was not funny or even a joke. |
this output has misunderstanding, maybe still encoding issue? |
@beiller |
@beiller key thing to be aware of: tokenizer works on bytes, not on characters. so:
|
some research. I use sentencepiece to tokenize a input and dump it. I got this: piece: ▁ main: prompt: '篇幅已经' "篇幅" is not found because in vocab table it is not what it is, but <0xE7>, <0xAF> ... etc. |
with sentencepiece which full of magic number I can get the result right: main: prompt: '篇幅已经' main: number of tokens in prompt = 10 1 -> '< s>' 29871 -> '▁' 234 -> '<0xE7>' 178 -> '<0xAF>' 138 -> '<0x87>' 232 -> '<0xE5>' 188 -> '<0xB9>' 136 -> '<0x85>' 31290 -> '已' 31412 -> '经' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 < s>▁<0xE7><0xAF><0x87><0xE5><0xB9><0x85>已经<0xE6><0x8A><0xB5>达了50万,我们再期望完成到全部五十平民的目标<0xE8><0xAE><0xA9>最多能安放自<0xE5><0xB7><0xB1>去生活。<0x0A>如果你有人看<0xE9><0x80><0x99>段情景不好▁就可以关注在线<0xE6><0x92><0xAD>客(部分)▁这里为我们一起同行,因此地方▁全是在他实现<0xE5><0xA5><0xB9>的目标。参与<0xE7><0xAF><0x87><0xE5><0x8D><0xB3>将从未来开始开展!< /s> [end of text] |
@wizzard0 the tokenizer works on sentence pieces. The tokenizer used is here: https://github.com/google/sentencepiece But the code is not using this it is using a reverse engineered method. Which works great actually most of the time. The problem here is the code assumes a 1 integer to 1 "string" mapping. To fix we need to either reverse engineer sentencepiece or include it in the repo. Its a small codebase I have a branch that can compile it and its working well. To reverse engineer it will need protobuf, or reverse engineer protobuf because the token model is stored in |
@beiller llama uses sentencepiece in BPE mode. It does not care about characters.
So you might need multiple tokens to build 1 printable character, and sometimes they won’t even add up to valid UTF-8. The same as with GPT3/ChatGPT. But that is not a problem. Just forget the characters and work with the bytes directly. |
@wizzard0 yes it falls back to byte encoding and I understand all of that. But in the code / model / dictionary we are using here, there is no 0x85 mapping to tokenID 136 (136 being what the model expects as an input). All the mappings are higher up the thread and they all map to 0xEFBFBD
Theres no way for us to map the unicode hex digits (0x85) to the proper ID (136) without sentencepiece. @wizd I believe even you had to use sentencepiece to "preprocess" your prompt in order to get the mapping 0x85 -> 136 correct? Edit maybe there's a hacky way to find |
meh, i guess it’s simpler to code than to explain >_< it’s dead simple. just please please forget characters, words, regexes etc efc. the model works with raw bytes. eg dictionary cannot be encoded as json because tokens wont be valid utf8 strings. have to go to sleep rn, maybe will return back tomorrow and write this |
I have a branch to include sentencepiece #66 |
Made some improvements in #73! |
I think UTF-8 encoding is fixed in #87 |
* move gpu slicing python code into a module * remove dead code in exporting gpu split * streamline solver and export with one entrypoint * new powerinfer.py module * wip: invoke Python to generate gpu split on the fly * wip: load gpu split on demand * wip: new gpu split file format * wip: generate and load new gpu idx format * wip: generate and load gpu index on the fly * minor: calculate total VRAM offloading via FFN splitting * add option to disble gpu index * bugfix * wip: bug fix for segment fault * bugfix * bugfix and testing * temporary fix for neuron factor in solving * fix: generated gpu idx path * Update README about gpu index
Fixed broken presets and miro
Thannk you for creating such a great inference engine which has 10x speedup.
Please add Unocode support to display other language properly.
The text was updated successfully, but these errors were encountered: