Unexpected behavior in CharTokenizer when setting split_with_space=True and non_lang_syms is not None #2707

mlxu995 · 2025-03-07T03:41:20Z

An unexpected empty character appears in the output when using CharTokenizer with split_with_space=True to process text containing non-language symbols.
Input text: "你好问问 <NIHAO_WENWEN>" (characters separated by spaces, <NIHAO_WENWEN> is a non-language symbol)
Expected output: ['你', '好', '问', '问', '<NIHAO_WENWEN>']
Actual output: ['你', '好', '问', '问', '', '<NIHAO_WENWEN>'] (contains an unexpected empty character)

I would like to change parts = [w for w in parts if len(w.strip()) > 0] to parts = [w.strip() for w in parts if len(w.strip()) > 0] in char_tokenizer.py line 42

This modification will affect tokens with leading or trailing spaces, and I am not sure if it has some side effects.

The text was updated successfully, but these errors were encountered:

mlxu995 · 2025-03-07T03:49:52Z

@Mddct 周哥给看下呗

Mddct · 2025-03-07T14:31:52Z

提个pr修一下顺便测测不split的时候表现是否正常

mlxu995 · 2025-03-10T03:58:02Z

提个pr修一下顺便测测不split的时候表现是否正常

符合预期，然后不split时同样无法支持英文

mlxu995 linked a pull request Mar 10, 2025 that will close this issue

[fix] minor fix char_tokenizer.py #2709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in CharTokenizer when setting split_with_space=True and non_lang_syms is not None #2707

Unexpected behavior in CharTokenizer when setting split_with_space=True and non_lang_syms is not None #2707

mlxu995 commented Mar 7, 2025

mlxu995 commented Mar 7, 2025

Mddct commented Mar 7, 2025

mlxu995 commented Mar 10, 2025

Unexpected behavior in CharTokenizer when setting split_with_space=True and non_lang_syms is not None #2707

Unexpected behavior in CharTokenizer when setting split_with_space=True and non_lang_syms is not None #2707

Comments

mlxu995 commented Mar 7, 2025

mlxu995 commented Mar 7, 2025

Mddct commented Mar 7, 2025

mlxu995 commented Mar 10, 2025