-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for fine tuning BERT language model #124
Conversation
Adds an example for loading a pre-trained BERT model and fine tune it as a language model (masked tokens & nextSentence) on your target corpus.
This looks like a great addition! Is it a full re-implementation of the pre-training script? |
The implementation uses the same sampling parameters and logic, but it's not a one-by-one re-implementation of the original pre-training script. Main differences:
Main similarities:
Happy to clarify further details! |
Hi @deepset-ai this is great and, just a suggestion, maybe if this makes it to the repo it would be great to include something in the README too about this functionality in this pull request? |
Just added some basic documentation to the README. Happy to include more, if @thomwolf thinks that this makes sense. |
Yes, I was going to ask you to add some information in the readme, it's great. The more is the better. If you can also add instructions on how to download a dataset for the training as in the other examples it would be perfect. If your dataset is private, do you have in mind another dataset that would let the users try your script easily? If not it's ok, don't worry. Another thing is that the |
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BERTDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this class. I think we should actually create a data.py
module in the main package that would gather a few utilities to work more easily with BERT
that could be imported from the package instead of copying them from script to script. I'm thinking about this dataset class but also utilities like convert_example_to_features
and maybe even your random_word
function.
Maybe we should add some abstract classes/low-level functions from which the data manipulation logic of the other examples (run_classifier
, run_squad
and extract_feature
) could be also build.
What do you think? I haven't look at the details yet so maybe it doesn't make sense. If you don't have time to look at that I will work it through when I start working on the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this would add value, since there's probably quite a few things you could share between the examples. In addition, this module would be helpful for people who develop new, more specific down-stream tasks. Unfortunately, I probably won't have time to work on this in the next weeks. It would be great, if you could take over, when working on the next release.
examples/run_lm_finetuning.py
Outdated
while item > doc_end: | ||
doc_id += 1 | ||
doc_start = doc_end + 1 | ||
doc_end += len(self.all_docs[doc_id]) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason you iterate every time on the dataset rather than constructing a index->doc mapping when you read the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right. This is a left-over from another approach. Creating an initial mapping makes way more sense. I have now added a mapping for index -> {doc_id, line}.
try: | ||
output_label.append(tokenizer.vocab[token]) | ||
except KeyError: | ||
# For unknown words (should not occur with BPE vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log this e.g. with a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
examples/run_lm_finetuning.py
Outdated
type=int, | ||
default=1, | ||
help="Number of updates steps to accumualte before performing a backward/update pass.") | ||
parser.add_argument('--optimize_on_cpu', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this now (see #116)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
examples/run_lm_finetuning.py
Outdated
for n, param in model.named_parameters()] | ||
else: | ||
param_optimizer = list(model.named_parameters()) | ||
no_decay = ['bias', 'gamma', 'beta'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part has changed also, see the new names in the current examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have tried to replicate the apex usage from the other examples. Since I have not much experience with apex yet, you might wanna check briefly, if there's something I missed.
examples/run_lm_finetuning.py
Outdated
param_optimizer = list(model.named_parameters()) | ||
no_decay = ['bias', 'gamma', 'beta'] | ||
optimizer_grouped_parameters = [ | ||
{'params': [p for n, p in param_optimizer if n not in no_decay], 'weight_decay_rate': 0.01}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this was wrong and is now fixed (should be something like p for name, p in param_optimizer if any(n in no_decay for n in name)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
This is something I'd been working on as well, congrats on a nice implementation! One question, though: I noticed you stripped out the code for evaluating on a test set, but when fine-tuning the LM on a smaller corpus, would it be worth keeping that in? Overfitting is much more of a risk in a smaller corpus. |
… line in doc' mapping. add warning for unknown word.
@Rocketknight1, you are right that we will probably need some better evaluation here. Currently, I have the feeling though that the evaluation on down-stream tasks is more meaningful (see also Jacob Devlin's comment here). But in addition, some better monitoring of the loss during and after training would be nice. Do you already have something in place and would like to contribute on this? Otherwise, I will try to find some time during the upcoming holidays to add this. |
4a8c950
to
8da280e
Compare
I don't have any evaluation code either, unfortunately! It might be easier to just evaluate on the final classification task, so it's not really urgent. I'll experiment with LM fine-tuning when I'm back at work in January. If I get good benefits on classification tasks I'll see what effect early stopping based on validation loss has, and if that turns out to be useful too I can submit a PR for it? |
Have you thought about extending the vocabulary after fine-tuning on custom dataset. This could be useful if the custom dataset has specific terms related to that domain. |
Adjusting the vocabulary before fine-tuning could be interesting, but you would need some smart approach to exchange "less important" tokens from the original byte pair vocab with "important" ones from your custom corpus (while maintaining the pre-trained embeddings for the rest of the vocab meaningful). |
Yes I am working on it. The idea is to add more items to the pretrained vocabulary. Also will adjust the model layers: bert.embeddings.word_embeddings.weight, cls.predictions.decoder.weight with the mean weights and also update cls.predictions.bias with mean bias for additional vocabulary words. Will send out a PR once I test it. |
Ok this looks very good, I am merging, thanks a lot @tholor! |
Add example for fine tuning BERT language model
Their latest version has a few issues, particularly with webgpu, and also uses .wasm files which are incompatible with their previous versions. So, while those issues are sorted out, it's best to freeze their packages to the latest stable version.
We are currently working on fine-tuning the language model on a new target corpus. This should improve the model, if the language style in your target corpus differs significantly from the one initially used for training BERT (Wiki + BookCorpus), but is still too small for training BERT from scratch. In our case, we apply this on a rather technical english corpus.
The sample script is loading a pre-trained BERT model and fine-tunes it as a language model (masked tokens & nextSentence) on your target corpus. The samples from the target corpus can either be fed to the model directly from memory or read from disk one-by-one.
Training the language model from scratch without loading a pre-trained BERT model is also not very difficult to do from here. In contrast, to the original tf repo, you can do the training with multi-GPU instead of TPU.
We thought this might be also helpful for others.