You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An unexpected empty character appears in the output when using CharTokenizer with split_with_space=True to process text containing non-language symbols.
Input text: "你 好 问 问 <NIHAO_WENWEN>" (characters separated by spaces, <NIHAO_WENWEN> is a non-language symbol)
Expected output: ['你', '好', '问', '问', '<NIHAO_WENWEN>']
Actual output: ['你', '好', '问', '问', '', '<NIHAO_WENWEN>'] (contains an unexpected empty character)
I would like to change parts = [w for w in parts if len(w.strip()) > 0] to parts = [w.strip() for w in parts if len(w.strip()) > 0] in char_tokenizer.py line 42
This modification will affect tokens with leading or trailing spaces, and I am not sure if it has some side effects.
The text was updated successfully, but these errors were encountered:
An unexpected empty character appears in the output when using CharTokenizer with split_with_space=True to process text containing non-language symbols.
Input text: "你 好 问 问 <NIHAO_WENWEN>" (characters separated by spaces, <NIHAO_WENWEN> is a non-language symbol)
Expected output: ['你', '好', '问', '问', '<NIHAO_WENWEN>']
Actual output: ['你', '好', '问', '问', '', '<NIHAO_WENWEN>'] (contains an unexpected empty character)
I would like to change
parts = [w for w in parts if len(w.strip()) > 0]
toparts = [w.strip() for w in parts if len(w.strip()) > 0]
in char_tokenizer.py line 42This modification will affect tokens with leading or trailing spaces, and I am not sure if it has some side effects.
The text was updated successfully, but these errors were encountered: