Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization issue with pretrained model #2

Closed
pruksmhc opened this issue Aug 23, 2021 · 4 comments
Closed

Tokenization issue with pretrained model #2

pruksmhc opened this issue Aug 23, 2021 · 4 comments

Comments

@pruksmhc
Copy link

I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

The command is below:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

The error is:
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['en']
Shuffling corpus!
Traceback (most recent call last):
File "pretrain_nmt.py", line 628, in
run_demo()
File "pretrain_nmt.py", line 625, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save
for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora.
File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked
iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus
first_ids = get_input_ids(text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids
tokens = self.tokenize(text, **kwargs)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize
tokenized_text = split_on_tokens(no_split_token, text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens
for token in tokenized_text
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in
for token in tokenized_text
TypeError: _tokenize() takes 2 positional arguments but 5 were given

Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments:
https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

@prajdabre
Copy link
Owner

prajdabre commented Aug 23, 2021

Hi,

Thats because I have not made the necessary modifications to the "generate_batches_monolingual_masked" method to handle official BART tokenizers. Can you give me a few hours? Ill code it up and make push my changes.

@prajdabre
Copy link
Owner

prajdabre commented Aug 23, 2021

Hi again,

Pull the latest version of the code and try the same command again. It should work. Lemme know if it doesn't.

BTW the default batch size is in number of tokens so please change it to something like 2048 or pass the flag
--batch_size_indicates_lines.

@pruksmhc
Copy link
Author

pruksmhc commented Aug 23, 2021

Hm, I'm still getting the tokenization error. Is it because I'm trying to train a BART model (using BartTokenizer) rather than MBart? I see that in the latest commit, only bart50/tokenization_mbart50.py has been modified for _tokenize function. Since the tokenize API for MBart and BART seems to differ slightly, perhaps it makes sense to have some if-else condition in tokenization_utils? Or to introduce sentencepiece into BART tokenization as well, although it seems like Sentencepiece isn't used in HF version of RoBERTa/BART tokenizer either.

@prajdabre
Copy link
Owner

Hi,

I previously thought that it was just the masking code that was the issue but it turned out that the GPT2 tokenizer did not take additional arguments which are passed by default by the code modifications I made to the tokenizer_utils methods.

Ive addressed and tested it this time.

Try the command:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 512 --shard_files

or

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8 --batch_size_indicates_lines --shard_files

Note that --shard_files is needed if you are running the code for the first time on unsharded data. (I know this just creates a duplicate file with the suffix 0 but I chose not to handle this case separately to keep my code simpler.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants