-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting error when pretraining with new languages sanskrit #34
Comments
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) |
Hi, I think you are not using the version of transformers that I have provided with the toolkit. Either that or your sentencepiece version is not the one in the requirements.txt file. Kindly uninstall any existing version of transformers by "pip uninstall transformers" and then install the version I have provided in the transformers folder by "cd transformers && python setup.py install" Also, your command needs some fixing. python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs XX --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path <local path like /home/raj/model_folder/model> XX should be one of the 11 language tokens that the model supports. Currently, I have not yet included a method to specify new languages. So the way to bypass this would be to use any of the tokens -- as,bn,gu,hi,kn,ml,mr,or,pa,ta,te. Typically choose one token which you dont plan to do any fine-tuning experiments with. |
Hi, Thanks for your reply : I am getting the error when I am using this command Exception: -- Process 0 terminated with the following error: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
But when I am putting: model_path a blank folder then the code is running |
Hi, The error made me realize that there was a tiny bug.
Should be:
Im surprised that it actually worked. Should have thrown an error. Also the way you specify the --model_path should be /home/aniruddha/IndicBART.ckpt/model It should actually be path+"/"+prefix where path = /home/aniruddha/IndicBART.ckpt and prefix = model Thats something I should clarify in the documentation even better. Please pull the latest code after 15 mins. |
Hi, |
Model path is the place where the model is saved. Pretrained model is where the params are loaded. |
So,We should not give any exiting model path right. Rather, I am giving anew path where the new pre-trained model will save.. AM I rIGHT? please confirm it once .. --model_path ai4bhart/IndicBART .. this ai4bhart/IndicBART is new directory.. |
as we are using args.use_official_pretrained so we don't need to give any exiting model path.. Because in your code, model_path is used to store the model, config, and tokenizer, AM I RIGHT? |
Both paths are needed. One is for loading one is for saving. If you dont use a pretrained model then just use the --model_path. If you dont specify the --model_path then the model will be saved with the default value for the argument (pl check the code). model_path should be be a local path. I think there is some confusion.
In my fixed version of the code if --use_official_pretrained is used then the config, model is loaded from --pretrained_model and tokenizer is loaded from --tokenizer_name_or_path. Your use case is simple: Fine-tune IndicBART on your own monolingual data so the following command is sufficient:
--pretrained_model ai4bharat/IndicBART because you want to load the official IndicBART model from HF hub. If you had instead downloaded the IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then you would have to first download the model checkpoint and tokenizer locally and then specify their paths to --pretrained_model and --tokenizer_name_or_path --use_official_pretrained because you are loading the official IndicBART model from HF hub. --model_path /tmp/model because you want to save your model in the /tmp folder. Model files will have several suffixes depending on their use. You will only be looking at the file model.pure_model |
Hi Thank you for your reply. Yes, model_path should be local path, actually I created it as the of ai4bhart/IndicBART like huggingface model name, and I have verified that the model is saving this path, thank you for your reply |
hi, i am getting one point..That your code is only working when I am
putting .hi extension. Otherwise its getting error. Like when I am passing
train.kn it is getting error, and when I renamed the file with train.hi it
works.
…On Tue, Aug 23, 2022 at 3:52 PM Raj Dabre ***@***.***> wrote:
Both paths are needed. One is for loading one is for saving. If you dont
use a pretrained model then just use the --model_path.
If you dont specify the --model_path then the model will be saved with the
default value for the argument (pl check the code).
model_path should be be a local path. I think there is some confusion.
1. ai4bhart/IndicBART is not a local path. It is an identifier for
huggingface.
2. Since it is a pretrained model it should be passed to
--pretrained_model.
3. Since this is an official model on the huggingface hub, you need to
specify an additional flag: --use_official_pretrained
In my fixed version of the code if --use_official_pretrained is used then
the config, model is loaded from --pretrained_model and tokenizer is loaded
from --tokenizer_name_or_path.
Your use case is simple: Fine-tune IndicBART on your own monolingual data
so the following command is sufficient:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained
--pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path
ai4bharat/IndicBART --langs hi --mono_src ../data/hi/hi.txt.00 --batch_size
8 --batch_size_indicates_lines --shard_files --model_path /tmp/model --port
8080
--pretrained_model ai4bharat/IndicBART because you want to load the
official IndicBART model from HF hub. If you had instead downloaded the
IndicBART model from here "https://github.com/AI4Bharat/indic-bart" then
you would have to first download the model checkpoint and tokenizer locally
and then specify their paths to --pretrained_model and
--tokenizer_name_or_path
--use_official_pretrained because you are loading the official IndicBART
model from HF hub.
--model_path /tmp/model because you want to save your model in the /tmp
folder. Model files will have several suffixes depending on their use. You
will only be looking at the file model.pure_model
—
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIWJFZREOXIVU2YETEVIAGLV2SQ5PANCNFSM56SD4VMA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
We are tring to pre-train a model with initializing indicBART. we use the below command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART
we are getting below error.
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init
self.sp_model.Load(vocab_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
The text was updated successfully, but these errors were encountered: