load tokenzier question #325

carter54 · 2020-07-01T03:31:47Z

Hello~ I have a question about load tokenizer in tokenizers v0.8.0

I have trained a bpt tokenizer by following script:

from tokenizers import ByteLevelBPETokenizer

paths = $train_file_path_list
tokenizer = ByteLevelBPETokenizer()
special_tokens = ['\n', '\r', '\t']

# Customize training
tokenizer.train(files=paths, vocab_size=30000, min_frequency=2, special_tokens=special_tokens)

# Save files to disk
tokenizer.save("bpe.json", pretty=True)

Then I can load the tokenizer by:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("bpe.json")

It works great.

But if I want to load the tokenizer by ByteLevelBPETokenizer which works in v0.7.0

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer("tokenizer_file_path")  
# in v0.7.0, tokenizer training generates a vocab file and a merge file, in v0.8.0 only generates one file with all information included

It doesn't work...

The reason I want to use ByteLevelBPETokenizer to load tokenizer is to use some features in ByteLevelBPETokenizer class.

Is there any method to load the custimized tokenizer by ByteLevelBPETokenizer in v0.8.0?

Thx a lot!

The text was updated successfully, but these errors were encountered:

carter54 · 2020-07-01T06:54:03Z

I c...

if I want to save the tokenizer in vocab.json and merge.txt, I should use:

tokeinzer.save_model()

rather than

tokenizer.save()

carter54 · 2020-07-01T07:04:15Z

but if I use

tokenizer = ByteLevelBPETokenizer(
    "/vocab.json",
    "/merges.txt",
    add_prefix_space=True,
)

to load the tokenizer, I have to use

tokenizer.add_special_tokens(special_token_list)

to manually add the special tokens, right?

n1t0 · 2020-07-10T14:31:55Z

Yes that's right. Can you elaborate on the features from ByteLevelBPETokenizer that you want to use?

kkpsiren · 2020-07-15T18:38:52Z

I can chip in and say that and say that the differences with PreTrainedTokenizer and Tokenizer is that Tokenizer lacks self.tokenizer.pad_token_id which will make the DataCollatorForLanguageModelingto output errors during training with Trainer.

jstremme · 2020-07-22T15:32:49Z

@n1t0 and @kkpsiren, is there a way to change a saved Tokenizer (I'm using the ByteLevelBPETokenizer) into a PreTrainedTokenizer in order to get these attributes without having to set them? I was getting the error you mentioned from DataCollatorForLanguageModeling, so I did the following:

tokenizer = Tokenizer.from_file(model_args.tokenizer_name)
        
tokenizer.bos_token="<s>"
tokenizer.eos_token="</s>"
tokenizer.sep_token="</s>"
tokenizer.cls_token="<s>"
tokenizer.unk_token="<unk>"
tokenizer.pad_token="<pad>"
tokenizer.mask_token="<mask>"
        
tokenizer.bos_token_id=0
tokenizer.eos_token_id=2
tokenizer.sep_token_id=2
tokenizer.cls_token_id=0
tokenizer.unk_token_id=3
tokenizer.pad_token_id=1
tokenizer.mask_token_id=4
        
tokenizer._bos_token="<s>"
tokenizer._eos_token="</s>"
tokenizer._sep_token="</s>"
tokenizer._cls_token="<s>"
tokenizer._unk_token="<unk>"
tokenizer._pad_token="<pad>"
tokenizer._mask_token="<mask>"
        
tokenizer._bos_token_id=0
tokenizer._eos_token_id=2
tokenizer._sep_token_id=2
tokenizer._cls_token_id=0
tokenizer._unk_token_id=3
tokenizer._pad_token_id=1
tokenizer._mask_token_id=4

But I still get: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'get_special_tokens_mask'.

It seems like I should not have to set all these properties and that when I train, save, and load the ByteLevelBPETokenizer everything should be there.

I am using transformers 2.9.0 and tokenizers 0.8.1 and attempting to train a custom ByteLevelBPETokenizer then pretrain a Reformer model using https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py.

aqibsaeed · 2020-09-18T22:43:34Z

I can chip in and say that and say that the differences with PreTrainedTokenizer and Tokenizer is that Tokenizer lacks self.tokenizer.pad_token_id which will make the DataCollatorForLanguageModelingto output errors during training with Trainer.

Is there an easy way around this?

n1t0 · 2020-10-20T21:42:07Z

Using the latest version of transformers, you can load a tokenizer saved from this library:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(...) # Any Tokenizer built with this library
tokenizer.save("my-tokenizer.json)

transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

rishabhjoshi · 2020-10-29T07:00:53Z

@n1t0 and @kkpsiren, is there a way to change a saved Tokenizer (I'm using the ByteLevelBPETokenizer) into a PreTrainedTokenizer in order to get these attributes without having to set them? I was getting the error you mentioned from DataCollatorForLanguageModeling, so I did the following:
tokenizer = Tokenizer.from_file(model_args.tokenizer_name)
        
tokenizer.bos_token="<s>"
tokenizer.eos_token="</s>"
tokenizer.sep_token="</s>"
tokenizer.cls_token="<s>"
tokenizer.unk_token="<unk>"
tokenizer.pad_token="<pad>"
tokenizer.mask_token="<mask>"
        
tokenizer.bos_token_id=0
tokenizer.eos_token_id=2
tokenizer.sep_token_id=2
tokenizer.cls_token_id=0
tokenizer.unk_token_id=3
tokenizer.pad_token_id=1
tokenizer.mask_token_id=4
        
tokenizer._bos_token="<s>"
tokenizer._eos_token="</s>"
tokenizer._sep_token="</s>"
tokenizer._cls_token="<s>"
tokenizer._unk_token="<unk>"
tokenizer._pad_token="<pad>"
tokenizer._mask_token="<mask>"
        
tokenizer._bos_token_id=0
tokenizer._eos_token_id=2
tokenizer._sep_token_id=2
tokenizer._cls_token_id=0
tokenizer._unk_token_id=3
tokenizer._pad_token_id=1
tokenizer._mask_token_id=4
But I still get: AttributeError: 'tokenizers.Tokenizer' object has no attribute 'get_special_tokens_mask'.

It seems like I should not have to set all these properties and that when I train, save, and load the ByteLevelBPETokenizer everything should be there.

I am using transformers 2.9.0 and tokenizers 0.8.1 and attempting to train a custom ByteLevelBPETokenizer then pretrain a Reformer model using https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py.

@jstremme Using RobertaTokenizer.from_pretrained() fixed it for me.

carter54 · 2020-11-03T06:40:37Z

@n1t0

Using the latest version of transformers, you can load a tokenizer saved from this library:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

tokenizer = Tokenizer(...) # Any Tokenizer built with this library
tokenizer.save("my-tokenizer.json)

transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

I tried the following code:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

trainer = trainers.BpeTrainer(vocab_size=30000, min_frequency=2)
tokenizer.train(trainer, data_paths)

# Save files to disk
tokenizer.save(tokenizer_path, pretty=True)

and try to apply the tokenizer by

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file= tokenizer_path)

encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

error happened:

  File "/extend-diskB/hr_data/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 98, in __init__
    "Couldn't instantiate the backend tokenizer from one of: "
ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

n1t0 · 2020-11-03T11:08:07Z

Oh yes sorry, I think it should be

tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)

Instead of

tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file=tokenizer_path)

Arnab9Codes · 2023-04-29T01:55:37Z

```python
transformers_tokenizer = PreTrainedTokenizerFast(fast_tokenizer_file="my-tokenizer.json")

it does not work, I have the the following error while doing it: data did not match any variant of untagged enum ModelWrapper at line 59249 column 3

Narsil · 2023-05-01T08:44:01Z

This means your tokenizer file is invalid.

Please don't squat old issues, but create new ones instead, it's unlikely that anything in this thread is relevant for current tokenizers version.

Thank you.

carter54 closed this as completed Jul 1, 2020

carter54 reopened this Jul 1, 2020

n1t0 closed this as completed Oct 20, 2020

cv277 mentioned this issue Feb 10, 2023

Fine Tuning the Model salesforce/progen#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load tokenzier question #325

load tokenzier question #325

carter54 commented Jul 1, 2020

carter54 commented Jul 1, 2020

carter54 commented Jul 1, 2020 •

edited

Loading

n1t0 commented Jul 10, 2020

kkpsiren commented Jul 15, 2020

jstremme commented Jul 22, 2020 •

edited

Loading

aqibsaeed commented Sep 18, 2020

n1t0 commented Oct 20, 2020

rishabhjoshi commented Oct 29, 2020

carter54 commented Nov 3, 2020

n1t0 commented Nov 3, 2020

Arnab9Codes commented Apr 29, 2023

Narsil commented May 1, 2023

load tokenzier question #325

load tokenzier question #325

Comments

carter54 commented Jul 1, 2020

carter54 commented Jul 1, 2020

carter54 commented Jul 1, 2020 • edited Loading

n1t0 commented Jul 10, 2020

kkpsiren commented Jul 15, 2020

jstremme commented Jul 22, 2020 • edited Loading

aqibsaeed commented Sep 18, 2020

n1t0 commented Oct 20, 2020

rishabhjoshi commented Oct 29, 2020

carter54 commented Nov 3, 2020

n1t0 commented Nov 3, 2020

Arnab9Codes commented Apr 29, 2023

Narsil commented May 1, 2023

carter54 commented Jul 1, 2020 •

edited

Loading

jstremme commented Jul 22, 2020 •

edited

Loading