Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The new tokenizer no longer encode space properly #2721

Closed
4 tasks done
jxy opened this issue Aug 22, 2023 · 1 comment · Fixed by #2724
Closed
4 tasks done

The new tokenizer no longer encode space properly #2721

jxy opened this issue Aug 22, 2023 · 1 comment · Fixed by #2724
Labels
high priority Very important issue

Comments

@jxy
Copy link
Contributor

jxy commented Aug 22, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.tokenizer

Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello', bos=False, eos=False)
[15043]
>>> tokenizer.encode(' Hello', bos=False, eos=False)
[29871, 15043]
>>> tokenizer.encode('  Hello', bos=False, eos=False)
[259, 15043]
>>> tokenizer.encode('   Hello', bos=False, eos=False)
[1678, 15043]
>>> tokenizer.encode('    Hello', bos=False, eos=False)
[268, 15043]
>>> tokenizer.encode('    Hello\n    Hello', bos=False, eos=False)
[268, 15043, 13, 1678, 15043]

Previous version, the one in PR #2306 before the GGUF merge

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}' 
{"tokens":[15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[259,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"   Hello"}'
{"tokens":[1678,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello"}'
{"tokens":[268,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello\n    Hello"}'
{"tokens":[268,15043,13,1678,15043]}

Current Behavior

$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"   Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello\n    Hello"}'
{"tokens":[29871,15043,13,15043]}
@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

Tagging @goerch for this.

@ggerganov ggerganov added the high priority Very important issue label Aug 22, 2023
goerch added a commit to goerch/llama.cpp that referenced this issue Aug 22, 2023
@klosax klosax linked a pull request Aug 22, 2023 that will close this issue
ggerganov pushed a commit that referenced this issue Sep 13, 2023
* Fix für #2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.
slaren pushed a commit that referenced this issue Sep 16, 2023
…izer-1 (#3170)

* Fix für #2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.

* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1

* Simplify logic

* Add missing change...

* Fix ugly compiler warning

* llama_tokenize should accept strings containing NUL now

* Adding huichen's test case
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this issue Sep 26, 2023
…ov#3096)

* Fix für ggerganov#2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this issue Sep 26, 2023
…izer-1 (ggerganov#3170)

* Fix für ggerganov#2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.

* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1

* Simplify logic

* Add missing change...

* Fix ugly compiler warning

* llama_tokenize should accept strings containing NUL now

* Adding huichen's test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants