Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BART for Pre-Training #6743

Closed
swashiro opened this issue Aug 26, 2020 · 19 comments · Fixed by #18297
Closed

BART for Pre-Training #6743

swashiro opened this issue Aug 26, 2020 · 19 comments · Fixed by #18297
Labels
Ex: LM (Pretraining) Related to language modeling pre-training Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@swashiro
Copy link

swashiro commented Aug 26, 2020

❓ Questions & Help

How can I run BART pre-training?
I have data to pre-training(Masked LM)

@patrickvonplaten
Copy link
Contributor

This should help: #5096 (comment)

@patrickvonplaten
Copy link
Contributor

@sshleifer - think this is the 3rd issue about Bart pre-training -> maybe it would be a good idea to release a small notebook at some point.

@sshleifer
Copy link
Contributor

sshleifer commented Aug 26, 2020

@patil-suraj you took a stab at this at some point? this may have been optimistic :(

@sshleifer sshleifer added the Ex: LM (Pretraining) Related to language modeling pre-training label Aug 26, 2020
@patil-suraj
Copy link
Contributor

patil-suraj commented Aug 26, 2020

Yes, I was trying to port fairseq dataset here, same for t5, I'll try to focus more on it when I'm done with current PRs, should strat with a notebook as Patrick said, then try to include it in examples/

@swashiro
Copy link
Author

@patrickvonplaten Does that mean I can train with Masked-input, input(label) and Decoder-input?

@patrickvonplaten
Copy link
Contributor

yes, this should be possible

@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 1, 2020
@stale stale bot closed this as completed Nov 9, 2020
@dhruvramani
Copy link

@patil-suraj any news on the pretraining script for Bart?

@prajdabre
Copy link

If anyone wants to train their MBART model then feel free to use this.
https://github.com/prajdabre/yanmtt

Contributions are welcome!

@thomas-li-sjtu
Copy link

@patil-suraj excuse me, is there any news on the pretraining script for Bart? Thanks.

@prajdabre
Copy link

@thomas-li-sjtu you can try my toolkit if you like. It's based on transformers and allows for Bart/mbart pretraining. https://github.com/prajdabre/yanmtt

@thomas-li-sjtu
Copy link

@thomas-li-sjtu you can try my toolkit if you like. It's based on transformers and allows for Bart/mbart pretraining. https://github.com/prajdabre/yanmtt

Hi there, here is my problem. I hope to pretrain a bart model based on my own dataset and fine tune it for another task (not nmt). I noticed that your toolkit designs for nmt so maybe it is not the one I need. Anyway, thanks for your reply!

@prajdabre
Copy link

@thomas-li-sjtu ok I understand. It's not just designed for NMT (despite its name). I've used it for summarisation and general NLG without problems. Good luck with your search.

@thomas-li-sjtu
Copy link

@thomas-li-sjtu ok I understand. It's not just designed for NMT (despite its name). I've used it for summarisation and general NLG without problems. Good luck with your search.

Wow that is awesome. I will try it for my task!

@prajdabre
Copy link

@thomas-li-sjtu cool. Feel free to raise issues as it helps me add new functionality that may be of use to people. If you want to know how to use it for summarisation (or generic nlg) then look here: https://github.com/AI4Bharat/indic-bart

@patil-suraj
Copy link
Contributor

Sorry to only come back to this issue now. If anyone is interested in adding this example script in Transformers, I would be more than happy to help :)

For BART pre-training we need the text-infilling + sentence-permutation data collator which you could find here https://github.com/morganmcg1/rotobart/blob/main/data_collator.py#L223

With this collator you could then modify and use run_summarization.py script here https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization.

Let me know if anyone is interested. :) cc @patrickvonplaten

@patrickvonplaten patrickvonplaten added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! and removed wontfix labels Feb 21, 2022
@Eurus-W
Copy link

Eurus-W commented Mar 6, 2022

Sorry to only come back to this issue now. If anyone is interested in adding this example script in Transformers, I would be more than happy to help :)

For BART pre-training we need the text-infilling + sentence-permutation data collator which you could find here https://github.com/morganmcg1/rotobart/blob/main/data_collator.py#L223

With this collator you could then modify and use run_summarization.py script here https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization.

Let me know if anyone is interested. :) cc @patrickvonplaten

I think the BART pre-training script is very useful for my work and many others. It is generous of you to add this example script in 'Transfromers' !!!

@Eurus-W
Copy link

Eurus-W commented Mar 6, 2022

Sorry to only come back to this issue now. If anyone is interested in adding this example script in Transformers, I would be more than happy to help :)

For BART pre-training we need the text-infilling + sentence-permutation data collator which you could find here https://github.com/morganmcg1/rotobart/blob/main/data_collator.py#L223

With this collator you could then modify and use run_summarization.py script here https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization.

Let me know if anyone is interested. :) cc @patrickvonplaten

Thanks for your reply and I think your method is absolutely feasible. But when I try it , I faced some errors that I can't fix. And could you please give me some help?
Here is my changes to run_summarization.py(tag 4.11.0)

  1. Import some necessary packages in https://github.com/morganmcg1/rotobart/blob/main/data_collator.py#L223
  2. Add full codes of DataCollatorForDenoisingTasks and also let class DataCollatorForDenoisingTasks inherit class DataCollatorForSeq2Seq in this way: class DataCollatorForDenoisingTasks(DataCollatorForSeq2Seq):
  3. Use the new collator: data_collator = DataCollatorForSeq2Seq(......) -> data_collator = DataCollatorForDenoisingTasks(.......)

Run the changed script and I get errors below.

Traceback (most recent call last):
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
runfile('/data/whq/tmp/SBartTry/fineBartPretrain.py', args=['--model_name_or_path', 'facebook/bart-base', '--do_train', '--do_eval', '--train_file', '/data/whq/tmp/SBartTry/tryData/clickbait_train.csv', '--validation_file', '/data/whq/tmp/SBartTry/tryData/clickbait_valid.csv', '--source_prefix', '', '--num_train_epochs=3', '--output_dir', '/data/whq/tmp/SBartTry/fineBartPretrain/clickbait', '--overwrite_output_dir', '--per_device_train_batch_size=16', '--per_device_eval_batch_size=16', '--predict_with_generate'], wdir='/data/whq/tmp/SBartTry')
File "/home/whq/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/home/whq/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/data/whq/tmp/SBartTry/fineBartPretrain.py", line 823, in
main()
File "/data/whq/tmp/SBartTry/fineBartPretrain.py", line 745, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/trainer.py", line 1325, in train
tr_loss_step = self.training_step(model, inputs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/trainer.py", line 1884, in training_step
loss = self.compute_loss(model, inputs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/trainer.py", line 1916, in compute_loss
outputs = model(**inputs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py", line 1336, in forward
return_dict=return_dict,
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py", line 1200, in forward
return_dict=return_dict,
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/whq/anaconda3/envs/pytorchenv/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py", line 769, in forward
input_shape = input_ids.size()
TypeError: 'int' object is not callable

Waiting for your generous reply! @patil-suraj

@OllieBroadhurst
Copy link
Contributor

@Eurus-W make sure you convert the numpy arrays in the batch returned by data_collator() into tensors.
batch["input_ids"] = torch.LongTensor(batch["input_ids"]), for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ex: LM (Pretraining) Related to language modeling pre-training Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants