-
Notifications
You must be signed in to change notification settings - Fork 840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load tokenzier question #325
Comments
I c... if I want to save the tokenizer in vocab.json and merge.txt, I should use:
rather than
|
but if I use
to load the tokenizer, I have to use
to manually add the special tokens, right? |
Yes that's right. Can you elaborate on the features from |
I can chip in and say that and say that the differences with |
@n1t0 and @kkpsiren, is there a way to change a saved Tokenizer (I'm using the
But I still get: It seems like I should not have to set all these properties and that when I train, save, and load the I am using transformers 2.9.0 and tokenizers 0.8.1 and attempting to train a custom |
Is there an easy way around this? |
Using the latest version of from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = Tokenizer(...) # Any Tokenizer built with this library
tokenizer.save("my-tokenizer.json)
transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json") |
@jstremme Using RobertaTokenizer.from_pretrained() fixed it for me. |
I tried the following code:
and try to apply the tokenizer by
error happened:
|
Oh yes sorry, I think it should be tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path) Instead of tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file=tokenizer_path) |
it does not work, I have the the following error while doing it: data did not match any variant of untagged enum ModelWrapper at line 59249 column 3 |
This means your tokenizer file is invalid. Please don't squat old issues, but create new ones instead, it's unlikely that anything in this thread is relevant for current Thank you. |
Hello~ I have a question about load tokenizer in tokenizers v0.8.0
I have trained a bpt tokenizer by following script:
Then I can load the tokenizer by:
It works great.
But if I want to load the tokenizer by ByteLevelBPETokenizer which works in v0.7.0
It doesn't work...
The reason I want to use ByteLevelBPETokenizer to load tokenizer is to use some features in ByteLevelBPETokenizer class.
Is there any method to load the custimized tokenizer by ByteLevelBPETokenizer in v0.8.0?
Thx a lot!
The text was updated successfully, but these errors were encountered: