Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable tokenizer test for LLaMa #3096

Merged
merged 12 commits into from
Sep 13, 2023
Merged

Reenable tokenizer test for LLaMa #3096

merged 12 commits into from
Sep 13, 2023

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Sep 9, 2023

With these changes I see the following output for test-tokenizer-1

main : error: token 3 detokenizes to ><(1) but tokenization of this detokenizes to ><(0)
main : error: codepoint 0 detokenizes to ><(0) instead of ><(1)
main : error: codepoint 9601 detokenizes to > <(1) instead of >▁<(3)

compared to the original tokenizer output

9601 >▁< > <

I believe the problems with token 3 and codepoint 0 are due to the C interface. I also think we still have problems with sentencepiece bytes above 127 (compare the decoding with the ground truth), which I'm unsure how to solve.

@ggerganov ggerganov merged commit 71ca2fa into ggerganov:master Sep 13, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
…ov#3096)

* Fix für ggerganov#2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants