Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for fine tuning BERT language model #124

Merged
merged 4 commits into from
Jan 7, 2019
Merged

Add example for fine tuning BERT language model #124

merged 4 commits into from
Jan 7, 2019

Conversation

ghost
Copy link

@ghost ghost commented Dec 18, 2018

We are currently working on fine-tuning the language model on a new target corpus. This should improve the model, if the language style in your target corpus differs significantly from the one initially used for training BERT (Wiki + BookCorpus), but is still too small for training BERT from scratch. In our case, we apply this on a rather technical english corpus.

The sample script is loading a pre-trained BERT model and fine-tunes it as a language model (masked tokens & nextSentence) on your target corpus. The samples from the target corpus can either be fed to the model directly from memory or read from disk one-by-one.

Training the language model from scratch without loading a pre-trained BERT model is also not very difficult to do from here. In contrast, to the original tf repo, you can do the training with multi-GPU instead of TPU.

We thought this might be also helpful for others.

Adds an example for loading a pre-trained BERT model and fine tune it as a language model (masked tokens & nextSentence) on your target corpus.
@thomwolf
Copy link
Member

This looks like a great addition!

Is it a full re-implementation of the pre-training script?

@ghost
Copy link
Author

ghost commented Dec 18, 2018

The implementation uses the same sampling parameters and logic, but it's not a one-by-one re-implementation of the original pre-training script.

Main differences:

  • In the original repo they first create a training set of TFrecords from a raw corpus (create_pretraining_data.py) and then perform model training using run_pretraining.py. We decided
    against this two step procedure and do the conversion from raw text to sample "on the fly" (more similar to this repo from codertimo). With this we can actually generate new samples every epoch.
  • We currently feed in pair of lines (= sentences) as one sample, while the original repo fills 90% of samples up with more sentences until max_seq_length is reached (for our use case this did not make any sense)

Main similarities:

  • All sampling / masking probabilities and parameters
  • Format of raw corpus (one sentence per line & empty line as doc delimiter)
  • Sampling strategy: Random nextSentence must be from another document
  • The data reader of codertimo is similar to our code, but didn't really match the original method of sampling.

Happy to clarify further details!

@davidefiocco
Copy link
Contributor

davidefiocco commented Dec 18, 2018

Hi @deepset-ai this is great and, just a suggestion, maybe if this makes it to the repo it would be great to include something in the README too about this functionality in this pull request?

@ghost
Copy link
Author

ghost commented Dec 19, 2018

Just added some basic documentation to the README. Happy to include more, if @thomwolf thinks that this makes sense.

@thomwolf
Copy link
Member

thomwolf commented Dec 19, 2018

Yes, I was going to ask you to add some information in the readme, it's great. The more is the better. If you can also add instructions on how to download a dataset for the training as in the other examples it would be perfect. If your dataset is private, do you have in mind another dataset that would let the users try your script easily? If not it's ok, don't worry.

Another thing is that the fp16 logic has now been switched to NVIDIA's apex module and we have gotten rid of the optimize_on_cpu option (see the relevant PR for more details). You can see the changes in the current examples like run_squad.py, it's actually a lot simpler since we don't have to manage parameters copy in the example and it's also faster. Do you think you could adapt the fp16 parts of your script similarly?

logger = logging.getLogger(__name__)


class BERTDataset(Dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this class. I think we should actually create a data.py module in the main package that would gather a few utilities to work more easily with BERT that could be imported from the package instead of copying them from script to script. I'm thinking about this dataset class but also utilities like convert_example_to_features and maybe even your random_word function.

Maybe we should add some abstract classes/low-level functions from which the data manipulation logic of the other examples (run_classifier, run_squad and extract_feature) could be also build.

What do you think? I haven't look at the details yet so maybe it doesn't make sense. If you don't have time to look at that I will work it through when I start working on the next release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this would add value, since there's probably quite a few things you could share between the examples. In addition, this module would be helpful for people who develop new, more specific down-stream tasks. Unfortunately, I probably won't have time to work on this in the next weeks. It would be great, if you could take over, when working on the next release.

while item > doc_end:
doc_id += 1
doc_start = doc_end + 1
doc_end += len(self.all_docs[doc_id]) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you iterate every time on the dataset rather than constructing a index->doc mapping when you read the file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right. This is a left-over from another approach. Creating an initial mapping makes way more sense. I have now added a mapping for index -> {doc_id, line}.

try:
output_label.append(tokenizer.vocab[token])
except KeyError:
# For unknown words (should not occur with BPE vocab)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log this e.g. with a warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type=int,
default=1,
help="Number of updates steps to accumualte before performing a backward/update pass.")
parser.add_argument('--optimize_on_cpu',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this now (see #116)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for n, param in model.named_parameters()]
else:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part has changed also, see the new names in the current examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have tried to replicate the apex usage from the other examples. Since I have not much experience with apex yet, you might wanna check briefly, if there's something I missed.

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this was wrong and is now fixed (should be something like p for name, p in param_optimizer if any(n in no_decay for n in name))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Rocketknight1
Copy link
Member

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

… line in doc' mapping. add warning for unknown word.
@ghost ghost changed the title Add example for fine tuning BERT language model (#1) Add example for fine tuning BERT language model Dec 20, 2018
@tholor
Copy link
Contributor

tholor commented Dec 20, 2018

This is something I'd been working on as well, congrats on a nice implementation!

One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

@julien-c julien-c force-pushed the master branch 3 times, most recently from 4a8c950 to 8da280e Compare December 20, 2018 21:33
@Rocketknight1
Copy link
Member

This is something I'd been working on as well, congrats on a nice implementation!
One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus.

@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice.

Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this.

I don't have any evaluation code either, unfortunately! It might be easier to just evaluate on the final classification task, so it's not really urgent. I'll experiment with LM fine-tuning when I'm back at work in January. If I get good benefits on classification tasks I'll see what effect early stopping based on validation loss has, and if that turns out to be useful too I can submit a PR for it?

@kaushaltrivedi
Copy link

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

@tholor
Copy link
Contributor

tholor commented Jan 2, 2019

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

@kaushaltrivedi
Copy link

Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain.

Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful).
We don't work on this at the moment. Looking forward to a PR, if you have time to work on this.

Yes I am working on it. The idea is to add more items to the pretrained vocabulary. Also will adjust the model layers: bert.embeddings.word_embeddings.weight, cls.predictions.decoder.weight with the mean weights and also update cls.predictions.bias with mean bias for additional vocabulary words.

Will send out a PR once I test it.

@thomwolf
Copy link
Member

thomwolf commented Jan 7, 2019

Ok this looks very good, I am merging, thanks a lot @tholor!

@thomwolf thomwolf merged commit c18bdb4 into huggingface:master Jan 7, 2019
qwang70 pushed a commit to DRL36/pytorch-pretrained-BERT that referenced this pull request Mar 2, 2019
Add example for fine tuning BERT language model
ocavue pushed a commit to ocavue/transformers that referenced this pull request Sep 13, 2023
Their latest version has a few issues, particularly with webgpu, and also uses .wasm files which are incompatible with their previous versions.

So, while those issues are sorted out, it's best to freeze their packages to the latest stable version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants