Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load tokenzier question #325

Closed
carter54 opened this issue Jul 1, 2020 · 12 comments
Closed

load tokenzier question #325

carter54 opened this issue Jul 1, 2020 · 12 comments

Comments

@carter54
Copy link

carter54 commented Jul 1, 2020

Hello~ I have a question about load tokenizer in tokenizers v0.8.0

I have trained a bpt tokenizer by following script:

from tokenizers import ByteLevelBPETokenizer

paths = $train_file_path_list
tokenizer = ByteLevelBPETokenizer()
special_tokens = ['\n', '\r', '\t']

# Customize training
tokenizer.train(files=paths, vocab_size=30000, min_frequency=2, special_tokens=special_tokens)

# Save files to disk
tokenizer.save("bpe.json", pretty=True)

Then I can load the tokenizer by:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("bpe.json")

It works great.

But if I want to load the tokenizer by ByteLevelBPETokenizer which works in v0.7.0

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer("tokenizer_file_path")  
# in v0.7.0, tokenizer training generates a vocab file and a merge file, in v0.8.0 only generates one file with all information included

It doesn't work...

The reason I want to use ByteLevelBPETokenizer to load tokenizer is to use some features in ByteLevelBPETokenizer class.

Is there any method to load the custimized tokenizer by ByteLevelBPETokenizer in v0.8.0?

Thx a lot!

@carter54
Copy link
Author

carter54 commented Jul 1, 2020

I c...

if I want to save the tokenizer in vocab.json and merge.txt, I should use:

tokeinzer.save_model()

rather than

tokenizer.save()

@carter54 carter54 closed this as completed Jul 1, 2020
@carter54
Copy link
Author

carter54 commented Jul 1, 2020

but if I use

tokenizer = ByteLevelBPETokenizer(
    "/vocab.json",
    "/merges.txt",
    add_prefix_space=True,
)

to load the tokenizer, I have to use

tokenizer.add_special_tokens(special_token_list)

to manually add the special tokens, right?

@carter54 carter54 reopened this Jul 1, 2020
@n1t0
Copy link
Member

n1t0 commented Jul 10, 2020

Yes that's right. Can you elaborate on the features from ByteLevelBPETokenizer that you want to use?

@kkpsiren
Copy link

I can chip in and say that and say that the differences with PreTrainedTokenizer and Tokenizer is that Tokenizer lacks self.tokenizer.pad_token_id which will make the DataCollatorForLanguageModelingto output errors during training with Trainer.

@jstremme
Copy link

jstremme commented Jul 22, 2020

@n1t0 and @kkpsiren, is there a way to change a saved Tokenizer (I'm using the ByteLevelBPETokenizer) into a PreTrainedTokenizer in order to get these attributes without having to set them? I was getting the error you mentioned from DataCollatorForLanguageModeling, so I did the following:

tokenizer = Tokenizer.from_file(model_args.tokenizer_name)
        
tokenizer.bos_token="<s>"
tokenizer.eos_token="</s>"
tokenizer.sep_token="</s>"
tokenizer.cls_token="<s>"
tokenizer.unk_token="<unk>"
tokenizer.pad_token="<pad>"
tokenizer.mask_token="<mask>"
        
tokenizer.bos_token_id=0
tokenizer.eos_token_id=2
tokenizer.sep_token_id=2
tokenizer.cls_token_id=0
tokenizer.unk_token_id=3
tokenizer.pad_token_id=1
tokenizer.mask_token_id=4
        
tokenizer._bos_token="<s>"
tokenizer._eos_token="</s>"
tokenizer._sep_token="</s>"
tokenizer._cls_token="<s>"
tokenizer._unk_token="<unk>"
tokenizer._pad_token="<pad>"
tokenizer._mask_token="<mask>"
        
tokenizer._bos_token_id=0
tokenizer._eos_token_id=2
tokenizer._sep_token_id=2
tokenizer._cls_token_id=0
tokenizer._unk_token_id=3
tokenizer._pad_token_id=1
tokenizer._mask_token_id=4

But I still get: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'get_special_tokens_mask'.

It seems like I should not have to set all these properties and that when I train, save, and load the ByteLevelBPETokenizer everything should be there.

I am using transformers 2.9.0 and tokenizers 0.8.1 and attempting to train a custom ByteLevelBPETokenizer then pretrain a Reformer model using https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py.

@aqibsaeed
Copy link

I can chip in and say that and say that the differences with PreTrainedTokenizer and Tokenizer is that Tokenizer lacks self.tokenizer.pad_token_id which will make the DataCollatorForLanguageModelingto output errors during training with Trainer.

Is there an easy way around this?

@n1t0
Copy link
Member

n1t0 commented Oct 20, 2020

Using the latest version of transformers, you can load a tokenizer saved from this library:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(...) # Any Tokenizer built with this library
tokenizer.save("my-tokenizer.json)

transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

@n1t0 n1t0 closed this as completed Oct 20, 2020
@rishabhjoshi
Copy link

@n1t0 and @kkpsiren, is there a way to change a saved Tokenizer (I'm using the ByteLevelBPETokenizer) into a PreTrainedTokenizer in order to get these attributes without having to set them? I was getting the error you mentioned from DataCollatorForLanguageModeling, so I did the following:

tokenizer = Tokenizer.from_file(model_args.tokenizer_name)
        
tokenizer.bos_token="<s>"
tokenizer.eos_token="</s>"
tokenizer.sep_token="</s>"
tokenizer.cls_token="<s>"
tokenizer.unk_token="<unk>"
tokenizer.pad_token="<pad>"
tokenizer.mask_token="<mask>"
        
tokenizer.bos_token_id=0
tokenizer.eos_token_id=2
tokenizer.sep_token_id=2
tokenizer.cls_token_id=0
tokenizer.unk_token_id=3
tokenizer.pad_token_id=1
tokenizer.mask_token_id=4
        
tokenizer._bos_token="<s>"
tokenizer._eos_token="</s>"
tokenizer._sep_token="</s>"
tokenizer._cls_token="<s>"
tokenizer._unk_token="<unk>"
tokenizer._pad_token="<pad>"
tokenizer._mask_token="<mask>"
        
tokenizer._bos_token_id=0
tokenizer._eos_token_id=2
tokenizer._sep_token_id=2
tokenizer._cls_token_id=0
tokenizer._unk_token_id=3
tokenizer._pad_token_id=1
tokenizer._mask_token_id=4

But I still get: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'get_special_tokens_mask'.

It seems like I should not have to set all these properties and that when I train, save, and load the ByteLevelBPETokenizer everything should be there.

I am using transformers 2.9.0 and tokenizers 0.8.1 and attempting to train a custom ByteLevelBPETokenizer then pretrain a Reformer model using https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py.

@jstremme Using RobertaTokenizer.from_pretrained() fixed it for me.

@carter54
Copy link
Author

carter54 commented Nov 3, 2020

@n1t0

Using the latest version of transformers, you can load a tokenizer saved from this library:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(...) # Any Tokenizer built with this library
tokenizer.save("my-tokenizer.json)

transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

I tried the following code:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

trainer = trainers.BpeTrainer(vocab_size=30000, min_frequency=2)
tokenizer.train(trainer, data_paths)

# Save files to disk
tokenizer.save(tokenizer_path, pretty=True)

and try to apply the tokenizer by

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file= tokenizer_path)

encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

error happened:

  File "/extend-diskB/hr_data/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 98, in __init__
    "Couldn't instantiate the backend tokenizer from one of: "
ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

@n1t0
Copy link
Member

n1t0 commented Nov 3, 2020

Oh yes sorry, I think it should be

tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)

Instead of

tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file=tokenizer_path)

@Arnab9Codes
Copy link

```python
transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

it does not work, I have the the following error while doing it: data did not match any variant of untagged enum ModelWrapper at line 59249 column 3

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2023

This means your tokenizer file is invalid.

Please don't squat old issues, but create new ones instead, it's unlikely that anything in this thread is relevant for current tokenizers version.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants