Reenable tokenizer test for LLaMa #3096

goerch · 2023-09-09T10:33:04Z

With these changes I see the following output for test-tokenizer-1

main : error: token 3 detokenizes to ><(1) but tokenization of this detokenizes to ><(0)
main : error: codepoint 0 detokenizes to ><(0) instead of ><(1)
main : error: codepoint 9601 detokenizes to > <(1) instead of >▁<(3)

compared to the original tokenizer output

9601 >▁< > <

I believe the problems with token 3 and codepoint 0 are due to the C interface. I also think we still have problems with sentencepiece bytes above 127 (compare the decoding with the ground truth), which I'm unsure how to solve.

Work on compiler warnings.

…ov#3096) * Fix für ggerganov#2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo.

goerch added 12 commits August 22, 2023 21:37

Fix für ggerganov#2721

3d59f50

Merge branch 'master' of https://github.com/goerch/llama.cpp

84220df

Merge branch 'ggerganov:master' into master

9a953a4

Reenable tokenizer test for LLaMa

89a7277

Add console.cpp dependency

52c9ecf

Fix dependency to common

4ee2152

Fixing wrong fix.

e903d5f

Make console usage platform specific

96533e0

Work on compiler warnings.

Adapting makefile

28b7494

Remove trailing whitespace

516a0d5

Adapting the other parts of the makefile

75a20d5

Fix typo.

16bf5f2

ggerganov approved these changes Sep 13, 2023

View reviewed changes

ggerganov merged commit 71ca2fa into ggerganov:master Sep 13, 2023

This was referenced Sep 14, 2023

GGUF #2398

Merged

Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 #3170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable tokenizer test for LLaMa #3096

Reenable tokenizer test for LLaMa #3096

goerch commented Sep 9, 2023

Reenable tokenizer test for LLaMa #3096

Reenable tokenizer test for LLaMa #3096

Conversation

goerch commented Sep 9, 2023