Roberta Tokenizer #2

dehghanm · 2022-09-03T19:46:31Z

Hi

I want to use Roberta Tokenizer. In the following, there is an example that shows how we can do this.

from transformers import AutoTokenizer
model_name = "HooshvareLab/roberta-fa-zwnj-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
string = "این یک سند است"
tokenized_string = tokenizer.tokenize(string)
print(tokenized_string)

The result of the above code is as follows:
['Ø§ÛĮÙĨ', 'ĠÛĮÚ©', 'ĠØ³ÙĨØ¯', 'ĠØ§Ø³Øª']
However, it should be:
["این", "یک", "سند" , "است"]
What is your idea to solve this issue?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roberta Tokenizer #2

Roberta Tokenizer #2

dehghanm commented Sep 3, 2022 •

edited

Loading

Roberta Tokenizer #2

Roberta Tokenizer #2

Comments

dehghanm commented Sep 3, 2022 • edited Loading

dehghanm commented Sep 3, 2022 •

edited

Loading