Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repository with recipes how to pretrain model from scratch on my own data #2814

Closed
ksopyla opened this issue Feb 11, 2020 · 22 comments
Closed

Comments

@ksopyla
Copy link

ksopyla commented Feb 11, 2020

🚀 Feature request

It would very useful to have documentation on how to train different models, not necessarily with the use of transformers, but with use external libs (like original BERT, fairseq, etc)

Maybe another repository with readmes or docs with recipes from those who already pretrain their model in order to reproduce procedure for other languages or domain.
There are many external resources (blogs, articles in arxiv) but without any details and very often they are not reproducible.

Motivation

Have a proven recipe for training the model. Make it easy for others to train a custom model. The community will easily train language or domain-specific models.
More models available in transformers library.

There are many issues related to this:

@julien-c
Copy link
Member

Hi @ksopyla that's a great – but very broad – question.

We just wrote a blogpost that might be helpful: https://huggingface.co/blog/how-to-train

The post itself is on GitHub so feel free to improve/edit it too.

@ksopyla
Copy link
Author

ksopyla commented Feb 15, 2020

Thank you @julien-c. It will help to add new models to transformer model repository :)

@ddofer
Copy link

ddofer commented Feb 19, 2020

Hi,
the blogpost is nice but it is NOT an end to end solution. I've been trying to learn how to use the huggingface "ecosystem" to build a LM model from scratch on a novel dataset, and the blogpost is not enough. Adding a jupyter notebook to the blog post would make it very easy for users to learn how to run things end to end. (VS "put in a Dataset type here" and "then run one of the scripts"). :)

@julien-c
Copy link
Member

@ddofer You are right, this is in process of being addressed at huggingface/blog#3

Feel free to help :)

@yuanbit
Copy link

yuanbit commented Feb 24, 2020

@julien-c Is it possible to do another example using bert to pretrain the LM instead of roberta? I followed the steps, but it doesn't seem to work when I changed the model_type to bert.

@ghost
Copy link

ghost commented Feb 26, 2020

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

@BramVanroy
Copy link
Collaborator

BramVanroy commented Feb 26, 2020

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

@julien-c
Copy link
Member

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

@ghost
Copy link

ghost commented Mar 4, 2020

I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

I'll give it a shot :)

@aditya-malte
Copy link

hey @laurenmoos,
A general community request is to work on a keras like wrapper for Transformers. It would be great if you could do that.

model=Roberta()
model.pretrain(lm_data)
model.finetune(final_data)
model.predict(XYZ)

@ghost
Copy link

ghost commented Mar 5, 2020

@aditya-malte I'd love to!

I will work on that and evaluate the request for additional documentation afterwards. Is there an issue to jump on?

@aditya-malte
Copy link

aditya-malte commented Mar 5, 2020

Let me know if you’re interested. I’d be excited to collaborate!

@ghost
Copy link

ghost commented Mar 5, 2020

@aditya-malte yes!

@san7988
Copy link

san7988 commented Apr 10, 2020

Hi,

Did we make any progress on the feature discussed above? A keras like wrapper sounds awesome for Transformers. I would like to contribute in the development.

@dashayushman
Copy link

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

@julien-c Thanks for this. I have a question regarding special_tokens_map.json file. When I just use the vocab.json and merges.txt from the tokenizer, the run_language_modeling.py shows the following info message

05/01/2020 17:44:01 - INFO - transformers.tokenization_utils -   Didn't find file /<path-to-my-output-dir>/special_tokens_map.json. We won't load it.

In the tutorial this has not been mentioned. Should we create this mapping file too?

@aditya-malte
Copy link

aditya-malte commented May 1, 2020

Hi @dashayushman,
The message you’ve shown is not an error/warning as such but is just an INFO message.
As far as I remember, the BPE model should work just fine with the vocab and merges file. You can ignore the message.
Thanks

@008karan
Copy link

@julien-c @aditya-malte
from blog post:

If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step.

how can I do that? Also, save the tokenized data?

@stale
Copy link

stale bot commented Jul 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 25, 2020
@kevin-yauris
Copy link

kevin-yauris commented Aug 1, 2020

Hi @BramVanroy @julien-c
Continuing #1999, it seems run_language_modeling.py is just for PyTorch and fine-tune a masked language model using Tensorflow doesn't have an example script yet. Any plan to make the Tensorflow version of the script or maybe how to modify the currentrun_language_modeling.py so it can be used for Tensorflow too? Thank you.

@stale stale bot removed the wontfix label Aug 1, 2020
@Novaal
Copy link

Novaal commented Sep 23, 2020

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

@julien-c
Copy link
Member

julien-c commented Oct 1, 2020

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

ping @jplu ;)

@stale
Copy link

stale bot commented Dec 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 4, 2020
@stale stale bot closed this as completed Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests