Repository with recipes how to pretrain model from scratch on my own data #2814

ksopyla · 2020-02-11T16:04:52Z

🚀 Feature request

It would very useful to have documentation on how to train different models, not necessarily with the use of transformers, but with use external libs (like original BERT, fairseq, etc)

Maybe another repository with readmes or docs with recipes from those who already pretrain their model in order to reproduce procedure for other languages or domain.
There are many external resources (blogs, articles in arxiv) but without any details and very often they are not reproducible.

Motivation

Have a proven recipe for training the model. Make it easy for others to train a custom model. The community will easily train language or domain-specific models.
More models available in transformers library.

There are many issues related to this:

julien-c · 2020-02-14T22:28:44Z

Hi @ksopyla that's a great – but very broad – question.

We just wrote a blogpost that might be helpful: https://huggingface.co/blog/how-to-train

The post itself is on GitHub so feel free to improve/edit it too.

ksopyla · 2020-02-15T11:42:57Z

Thank you @julien-c. It will help to add new models to transformer model repository :)

ddofer · 2020-02-19T08:11:39Z

Hi,
the blogpost is nice but it is NOT an end to end solution. I've been trying to learn how to use the huggingface "ecosystem" to build a LM model from scratch on a novel dataset, and the blogpost is not enough. Adding a jupyter notebook to the blog post would make it very easy for users to learn how to run things end to end. (VS "put in a Dataset type here" and "then run one of the scripts"). :)

julien-c · 2020-02-20T15:25:04Z

@ddofer You are right, this is in process of being addressed at huggingface/blog#3

Feel free to help :)

yuanbit · 2020-02-24T11:31:39Z

@julien-c Is it possible to do another example using bert to pretrain the LM instead of roberta? I followed the steps, but it doesn't seem to work when I changed the model_type to bert.

ghost · 2020-02-26T16:07:06Z

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

BramVanroy · 2020-02-26T16:17:50Z

I am a new contributor and thought this might be a reasonable issue to start with.

I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.

Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

julien-c · 2020-02-27T21:08:55Z

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

ghost · 2020-03-04T20:35:57Z

I am a new contributor and thought this might be a reasonable issue to start with.
I'm happy to add an additional example of using bert rather than roberta to pretrain the LM.
Please let me know if this would be helpful and/or if starting elsewhere would be better

Great that you want to contribute!; any help is welcome! Fine-tuning and pretraining BERT seems to be already covered in run_language_modeling.py though. So your contribution should differ significantly from this functionality. Perhaps it can be written in a more educational rather than production-ready way? That would definitely be useful - explaining all concepts from scratch and such. (But not an easy task.)

I'll give it a shot :)

aditya-malte · 2020-03-05T04:23:17Z

hey @laurenmoos,
A general community request is to work on a keras like wrapper for Transformers. It would be great if you could do that.

model=Roberta()
model.pretrain(lm_data)
model.finetune(final_data)
model.predict(XYZ)

ghost · 2020-03-05T18:50:27Z

@aditya-malte I'd love to!

I will work on that and evaluate the request for additional documentation afterwards. Is there an issue to jump on?

aditya-malte · 2020-03-05T19:20:58Z

Let me know if you’re interested. I’d be excited to collaborate!

ghost · 2020-03-05T19:23:42Z

@aditya-malte yes!

san7988 · 2020-04-10T17:45:39Z

Hi,

Did we make any progress on the feature discussed above? A keras like wrapper sounds awesome for Transformers. I would like to contribute in the development.

dashayushman · 2020-05-01T18:14:18Z

First version of a notebook is up over at https://github.com/huggingface/blog/tree/master/notebooks
(thanks @aditya-malte for the help)

@julien-c Thanks for this. I have a question regarding special_tokens_map.json file. When I just use the vocab.json and merges.txt from the tokenizer, the run_language_modeling.py shows the following info message

05/01/2020 17:44:01 - INFO - transformers.tokenization_utils -   Didn't find file /<path-to-my-output-dir>/special_tokens_map.json. We won't load it.

In the tutorial this has not been mentioned. Should we create this mapping file too?

aditya-malte · 2020-05-01T18:35:36Z

Hi @dashayushman,
The message you’ve shown is not an error/warning as such but is just an INFO message.
As far as I remember, the BPE model should work just fine with the vocab and merges file. You can ignore the message.
Thanks

008karan · 2020-05-25T14:44:08Z

@julien-c @aditya-malte
from blog post:

If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step.

how can I do that? Also, save the tokenized data?

stale · 2020-07-25T02:38:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kevin-yauris · 2020-08-01T03:39:14Z

Hi @BramVanroy @julien-c
Continuing #1999, it seems run_language_modeling.py is just for PyTorch and fine-tune a masked language model using Tensorflow doesn't have an example script yet. Any plan to make the Tensorflow version of the script or maybe how to modify the currentrun_language_modeling.py so it can be used for Tensorflow too? Thank you.

Novaal · 2020-09-23T06:53:17Z

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

julien-c · 2020-10-01T13:45:49Z

I would also like to see an example, how to train a language model (like BERT) from scratch with tensorflow on my own dataset, so i can finetune it later on a specific task.

ping @jplu ;)

stale · 2020-12-04T22:07:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BramVanroy added the Documentation label Feb 20, 2020

stale bot added the wontfix label Jul 25, 2020

stale bot removed the wontfix label Aug 1, 2020

stale bot added the wontfix label Dec 4, 2020

stale bot closed this as completed Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository with recipes how to pretrain model from scratch on my own data #2814

Repository with recipes how to pretrain model from scratch on my own data #2814

ksopyla commented Feb 11, 2020 •

edited

Loading

julien-c commented Feb 14, 2020

ksopyla commented Feb 15, 2020

ddofer commented Feb 19, 2020

julien-c commented Feb 20, 2020

yuanbit commented Feb 24, 2020

ghost commented Feb 26, 2020

BramVanroy commented Feb 26, 2020 •

edited

Loading

julien-c commented Feb 27, 2020

ghost commented Mar 4, 2020

aditya-malte commented Mar 5, 2020

ghost commented Mar 5, 2020

aditya-malte commented Mar 5, 2020 •

edited

Loading

ghost commented Mar 5, 2020

san7988 commented Apr 10, 2020

dashayushman commented May 1, 2020

aditya-malte commented May 1, 2020 •

edited

Loading

008karan commented May 25, 2020

stale bot commented Jul 25, 2020

kevin-yauris commented Aug 1, 2020 •

edited

Loading

Novaal commented Sep 23, 2020 •

edited

Loading

julien-c commented Oct 1, 2020

stale bot commented Dec 4, 2020

Repository with recipes how to pretrain model from scratch on my own data #2814

Repository with recipes how to pretrain model from scratch on my own data #2814

Comments

ksopyla commented Feb 11, 2020 • edited Loading

🚀 Feature request

Motivation

julien-c commented Feb 14, 2020

ksopyla commented Feb 15, 2020

ddofer commented Feb 19, 2020

julien-c commented Feb 20, 2020

yuanbit commented Feb 24, 2020

ghost commented Feb 26, 2020

BramVanroy commented Feb 26, 2020 • edited Loading

julien-c commented Feb 27, 2020

ghost commented Mar 4, 2020

aditya-malte commented Mar 5, 2020

ghost commented Mar 5, 2020

aditya-malte commented Mar 5, 2020 • edited Loading

ghost commented Mar 5, 2020

san7988 commented Apr 10, 2020

dashayushman commented May 1, 2020

aditya-malte commented May 1, 2020 • edited Loading

008karan commented May 25, 2020

stale bot commented Jul 25, 2020

kevin-yauris commented Aug 1, 2020 • edited Loading

Novaal commented Sep 23, 2020 • edited Loading

julien-c commented Oct 1, 2020

stale bot commented Dec 4, 2020

ksopyla commented Feb 11, 2020 •

edited

Loading

BramVanroy commented Feb 26, 2020 •

edited

Loading

aditya-malte commented Mar 5, 2020 •

edited

Loading

aditya-malte commented May 1, 2020 •

edited

Loading

kevin-yauris commented Aug 1, 2020 •

edited

Loading

Novaal commented Sep 23, 2020 •

edited

Loading