Add example for fine tuning BERT language model #124

ghost · 2018-12-18T09:48:12Z

We are currently working on fine-tuning the language model on a new target corpus. This should improve the model, if the language style in your target corpus differs significantly from the one initially used for training BERT (Wiki + BookCorpus), but is still too small for training BERT from scratch. In our case, we apply this on a rather technical english corpus.

The sample script is loading a pre-trained BERT model and fine-tunes it as a language model (masked tokens & nextSentence) on your target corpus. The samples from the target corpus can either be fed to the model directly from memory or read from disk one-by-one.

Training the language model from scratch without loading a pre-trained BERT model is also not very difficult to do from here. In contrast, to the original tf repo, you can do the training with multi-GPU instead of TPU.

We thought this might be also helpful for others.

Adds an example for loading a pre-trained BERT model and fine tune it as a language model (masked tokens & nextSentence) on your target corpus.

thomwolf · 2018-12-18T12:08:48Z

This looks like a great addition!

Is it a full re-implementation of the pre-training script?

ghost · 2018-12-18T15:11:39Z

The implementation uses the same sampling parameters and logic, but it's not a one-by-one re-implementation of the original pre-training script.

Main differences:

In the original repo they first create a training set of TFrecords from a raw corpus (create_pretraining_data.py) and then perform model training using run_pretraining.py. We decided
against this two step procedure and do the conversion from raw text to sample "on the fly" (more similar to this repo from codertimo). With this we can actually generate new samples every epoch.
We currently feed in pair of lines (= sentences) as one sample, while the original repo fills 90% of samples up with more sentences until max_seq_length is reached (for our use case this did not make any sense)

Main similarities:

All sampling / masking probabilities and parameters
Format of raw corpus (one sentence per line & empty line as doc delimiter)
Sampling strategy: Random nextSentence must be from another document
The data reader of codertimo is similar to our code, but didn't really match the original method of sampling.

Happy to clarify further details!

davidefiocco · 2018-12-18T21:17:18Z

Hi @deepset-ai this is great and, just a suggestion, maybe if this makes it to the repo it would be great to include something in the README too about this functionality in this pull request?

…ined-BERT

ghost · 2018-12-19T08:27:24Z

Just added some basic documentation to the README. Happy to include more, if @thomwolf thinks that this makes sense.

thomwolf · 2018-12-19T08:43:50Z

Yes, I was going to ask you to add some information in the readme, it's great. The more is the better. If you can also add instructions on how to download a dataset for the training as in the other examples it would be perfect. If your dataset is private, do you have in mind another dataset that would let the users try your script easily? If not it's ok, don't worry.

Another thing is that the fp16 logic has now been switched to NVIDIA's apex module and we have gotten rid of the optimize_on_cpu option (see the relevant PR for more details). You can see the changes in the current examples like run_squad.py, it's actually a lot simpler since we don't have to manage parameters copy in the example and it's also faster. Do you think you could adapt the fp16 parts of your script similarly?

thomwolf · 2018-12-19T08:59:24Z

examples/run_lm_finetuning.py

+logger = logging.getLogger(__name__)
+
+
+class BERTDataset(Dataset):


I like this class. I think we should actually create a data.py module in the main package that would gather a few utilities to work more easily with BERT that could be imported from the package instead of copying them from script to script. I'm thinking about this dataset class but also utilities like convert_example_to_features and maybe even your random_word function.

Maybe we should add some abstract classes/low-level functions from which the data manipulation logic of the other examples (run_classifier, run_squad and extract_feature) could be also build.

What do you think? I haven't look at the details yet so maybe it doesn't make sense. If you don't have time to look at that I will work it through when I start working on the next release.

I also think this would add value, since there's probably quite a few things you could share between the examples. In addition, this module would be helpful for people who develop new, more specific down-stream tasks. Unfortunately, I probably won't have time to work on this in the next weeks. It would be great, if you could take over, when working on the next release.

thomwolf · 2018-12-19T09:00:20Z

examples/run_lm_finetuning.py

+            while item > doc_end:
+                doc_id += 1
+                doc_start = doc_end + 1
+                doc_end += len(self.all_docs[doc_id]) - 1


Is there a specific reason you iterate every time on the dataset rather than constructing a index->doc mapping when you read the file?

You are totally right. This is a left-over from another approach. Creating an initial mapping makes way more sense. I have now added a mapping for index -> {doc_id, line}.

thomwolf · 2018-12-19T09:03:22Z

examples/run_lm_finetuning.py

+            try:
+                output_label.append(tokenizer.vocab[token])
+            except KeyError:
+                # For unknown words (should not occur with BPE vocab)


Should we log this e.g. with a warning?

thomwolf · 2018-12-19T09:04:42Z

examples/run_lm_finetuning.py

+                        type=int,
+                        default=1,
+                        help="Number of updates steps to accumualte before performing a backward/update pass.")
+    parser.add_argument('--optimize_on_cpu',


We can remove this now (see #116)

thomwolf · 2018-12-19T09:05:33Z

examples/run_lm_finetuning.py

+                            for n, param in model.named_parameters()]
+    else:
+        param_optimizer = list(model.named_parameters())
+    no_decay = ['bias', 'gamma', 'beta']


This part has changed also, see the new names in the current examples

Done. I have tried to replicate the apex usage from the other examples. Since I have not much experience with apex yet, you might wanna check briefly, if there's something I missed.

thomwolf · 2018-12-19T09:06:28Z

examples/run_lm_finetuning.py

+        param_optimizer = list(model.named_parameters())
+    no_decay = ['bias', 'gamma', 'beta']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},


And this was wrong and is now fixed (should be something like p for name, p in param_optimizer if any(n in no_decay for n in name))

Rocketknight1 · 2018-12-19T17:06:50Z

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

… line in doc' mapping. add warning for unknown word.

tholor · 2018-12-20T18:02:33Z

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

Rocketknight1 · 2018-12-22T17:29:58Z

This is something I'd been working on as well, congrats on a nice implementation!
One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

I don't have any evaluation code either, unfortunately! It might be easier to just evaluate on the final classification task, so it's not really urgent. I'll experiment with LM fine-tuning when I'm back at work in January. If I get good benefits on classification tasks I'll see what effect early stopping based on validation loss has, and if that turns out to be useful too I can submit a PR for it?

kaushaltrivedi · 2018-12-31T05:04:39Z

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

tholor · 2019-01-02T08:13:20Z

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

kaushaltrivedi · 2019-01-03T05:31:51Z

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

Yes I am working on it. The idea is to add more items to the pretrained vocabulary. Also will adjust the model layers: bert.embeddings.word_embeddings.weight, cls.predictions.decoder.weight with the mean weights and also update cls.predictions.bias with mean bias for additional vocabulary words.

Will send out a PR once I test it.

thomwolf · 2019-01-07T11:03:46Z

Ok this looks very good, I am merging, thanks a lot @tholor!

Add example for fine tuning BERT language model

Their latest version has a few issues, particularly with webgpu, and also uses .wasm files which are incompatible with their previous versions. So, while those issues are sorted out, it's best to freeze their packages to the latest stable version.

Add example for fine tuning BERT language model (#1)

a58361f

Adds an example for loading a pre-trained BERT model and fine tune it as a language model (masked tokens & nextSentence) on your target corpus.

tholor added 2 commits December 19, 2018 09:22

update readme for run_lm_finetuning

67f4dd5

Merge branch 'master' of https://github.com/deepset-ai/pytorch-pretra…

17595ef

…ined-BERT

thomwolf reviewed Dec 19, 2018

View reviewed changes

add exemplary training data. update to nvidia apex. refactor 'item ->…

e5fc98c

… line in doc' mapping. add warning for unknown word.

ghost changed the title ~~Add example for fine tuning BERT language model (#1)~~ Add example for fine tuning BERT language model Dec 20, 2018

julien-c force-pushed the master branch 3 times, most recently from 4a8c950 to 8da280e Compare December 20, 2018 21:33

thomwolf merged commit c18bdb4 into huggingface:master Jan 7, 2019

thomwolf mentioned this pull request Jan 7, 2019

How to pretrain my own data with this pytorch code? #170

Closed

qwang70 pushed a commit to DRL36/pytorch-pretrained-BERT that referenced this pull request Mar 2, 2019

Merge pull request huggingface#124 from deepset-ai/master

6a13672

Add example for fine tuning BERT language model

Rocketknight1 mentioned this pull request Mar 13, 2019

run_lm_finetuning generates short training cases #376

Closed

tmchojo mentioned this pull request Mar 28, 2019

Is there any pre-training example code? #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example for fine tuning BERT language model #124

Add example for fine tuning BERT language model #124

ghost commented Dec 18, 2018

thomwolf commented Dec 18, 2018

ghost commented Dec 18, 2018 •

edited by ghost

Loading

davidefiocco commented Dec 18, 2018 •

edited

Loading

ghost commented Dec 19, 2018

thomwolf commented Dec 19, 2018 •

edited

Loading

thomwolf Dec 19, 2018

tholor Dec 20, 2018

thomwolf Dec 19, 2018

tholor Dec 20, 2018

thomwolf Dec 19, 2018

tholor Dec 20, 2018

thomwolf Dec 19, 2018

tholor Dec 20, 2018

thomwolf Dec 19, 2018

tholor Dec 20, 2018

thomwolf Dec 19, 2018

tholor Dec 20, 2018

Rocketknight1 commented Dec 19, 2018

tholor commented Dec 20, 2018 •

edited

Loading

Rocketknight1 commented Dec 22, 2018

kaushaltrivedi commented Dec 31, 2018

tholor commented Jan 2, 2019

kaushaltrivedi commented Jan 3, 2019

thomwolf commented Jan 7, 2019

		logger = logging.getLogger(__name__)


		class BERTDataset(Dataset):

Add example for fine tuning BERT language model #124

Add example for fine tuning BERT language model #124

Conversation

ghost commented Dec 18, 2018

thomwolf commented Dec 18, 2018

ghost commented Dec 18, 2018 • edited by ghost Loading

davidefiocco commented Dec 18, 2018 • edited Loading

ghost commented Dec 19, 2018

thomwolf commented Dec 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented Dec 19, 2018

tholor commented Dec 20, 2018 • edited Loading

Rocketknight1 commented Dec 22, 2018

kaushaltrivedi commented Dec 31, 2018

tholor commented Jan 2, 2019

kaushaltrivedi commented Jan 3, 2019

thomwolf commented Jan 7, 2019

ghost commented Dec 18, 2018 •

edited by ghost

Loading

davidefiocco commented Dec 18, 2018 •

edited

Loading

thomwolf commented Dec 19, 2018 •

edited

Loading

tholor commented Dec 20, 2018 •

edited

Loading