Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add learning rate visualisation and manual parameter #161

Merged
merged 11 commits into from
Jul 16, 2023

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented May 8, 2023

In line with the use of the incremental training, knowing the final learning rate of the "previous training", and the ability to set it manually can be helpful

This PR (updated list):

  • adds a new parameter --learning-rate to override the default learning rate value to the *Tagging applications
  • removes all the hard-coded learning rates and set the default values as discussed in add learning rate visualisation and manual parameter  #161 (comment) :
    • transformers (2e-5) and
    • RNN (0.0001)
  • add a callback that prints the learning rate at the end of each epoch

@lfoppiano lfoppiano requested a review from kermitt2 May 8, 2023 23:49
@kermitt2
Copy link
Owner

Hi Luca !

I realize that the default learning rate should depend on the architecture and that was not well done - low learning rates like 0.0001 or lower are typical for for BERT, but RNN need much higher value. So the older default (0.001) was too high for BERT I think, but this new default one (0.0001) in the PR is now too low for RNN.

It's definitively useful to add it as command line parameter, but I think we should set the default learning rate in the configure() functions for the different application depending on the selected architecture.

@kermitt2
Copy link
Owner

Ok I double check: for both sequence labeling and text classification the learning rate for all transformer architectures is hard coded at init_lr=2e-5 in the decay optimizer (this is the usual value). It's not using the config value.

Only RNN models were using the config learning rate value, and the default (0.001) was set for this.

@kermitt2
Copy link
Owner

So that was my assumption when I added the decay optimizers:

  1. for transformers we always use 2e-5 as learning rate because everybody uses that value and we don't want to change it (I remember vaguely having tested 1e-5 but it was very slightly worse and higher values are not recommended because they make the model more "forgetting" some training examples).

  2. for RNN models, changing the learning rate is more usual, so it uses the config value.

@lfoppiano
Copy link
Collaborator Author

Thanks for the clarification.
I think having the configurable parameter could be useful for example to lower it for incremental training.
I propose the following:

  • we fetch the data from the trainingconfig (which will then print the right value at startup)
  • I set the default value in wrapper to None and reset the default in the constructor based on the fact that it's a transformer or not

We can set the value also in the application, but at least we dont' risk to run it with the wrong default value.

Let me know if this makes sense

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Jun 15, 2023

I've fixed the default values (also in the classification trainer).

I've added a callback that prints the LR decayed at each epoch, however I have the following:

---
max_epoch: 60
early_stop: True
patience: 5
batch_size (training): 80
max_sequence_length: 30
model_name: grobid-date-BERT
learning_rate:  2e-05
use_ELMo:  False
---
[...]
__________________________________________________________________________________________________
Epoch 1/60
8/8 [==============================] - ETA: 0s - loss: 2.0593	f1 (micro): 47.24
8/8 [==============================] - 69s 8s/step - loss: 2.0593 - f1: 0.4724 - learning_rate: 3.8095e-06
Epoch 2/60
8/8 [==============================] - ETA: 0s - loss: 1.2964	f1 (micro): 82.82
8/8 [==============================] - 43s 4s/step - loss: 1.2964 - f1: 0.8282 - learning_rate: 7.6190e-06
Epoch 3/60
8/8 [==============================] - ETA: 0s - loss: 0.6858	f1 (micro): 87.61
8/8 [==============================] - 29s 4s/step - loss: 0.6858 - f1: 0.8761 - learning_rate: 1.1429e-05
Epoch 4/60
8/8 [==============================] - ETA: 0s - loss: 0.3628	f1 (micro): 92.73
8/8 [==============================] - 29s 4s/step - loss: 0.3628 - f1: 0.9273 - learning_rate: 1.5238e-05
Epoch 5/60
8/8 [==============================] - ETA: 0s - loss: 0.1840	f1 (micro): 94.89
8/8 [==============================] - 15s 2s/step - loss: 0.1840 - f1: 0.9489 - learning_rate: 1.9048e-05
Epoch 6/60
8/8 [==============================] - ETA: 0s - loss: 0.1167	f1 (micro): 94.61
8/8 [==============================] - 25s 3s/step - loss: 0.1167 - f1: 0.9461 - learning_rate: 1.9683e-05
Epoch 7/60
8/8 [==============================] - ETA: 0s - loss: 0.0769	f1 (micro): 94.89
8/8 [==============================] - 23s 3s/step - loss: 0.0769 - f1: 0.9489 - learning_rate: 1.9259e-05
Epoch 8/60
8/8 [==============================] - ETA: 0s - loss: 0.0656	f1 (micro): 95.50
8/8 [==============================] - 23s 3s/step - loss: 0.0656 - f1: 0.9550 - learning_rate: 1.8836e-05
Epoch 9/60
8/8 [==============================] - ETA: 0s - loss: 0.0562	f1 (micro): 95.50
8/8 [==============================] - 36s 5s/step - loss: 0.0562 - f1: 0.9550 - learning_rate: 1.8413e-05
Epoch 10/60
8/8 [==============================] - ETA: 0s - loss: 0.0514	f1 (micro): 96.10
8/8 [==============================] - 21s 3s/step - loss: 0.0514 - f1: 0.9610 - learning_rate: 1.7989e-05
Epoch 11/60
8/8 [==============================] - ETA: 0s - loss: 0.0424	f1 (micro): 96.70
8/8 [==============================] - 19s 2s/step - loss: 0.0424 - f1: 0.9670 - learning_rate: 1.7566e-05
Epoch 12/60
8/8 [==============================] - ETA: 0s - loss: 0.0348	f1 (micro): 96.70
8/8 [==============================] - 17s 2s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.7143e-05
Epoch 13/60
8/8 [==============================] - ETA: 0s - loss: 0.0348	f1 (micro): 96.70
8/8 [==============================] - 38s 5s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.6720e-05
Epoch 14/60
8/8 [==============================] - ETA: 0s - loss: 0.0294	f1 (micro): 96.70
8/8 [==============================] - 36s 5s/step - loss: 0.0294 - f1: 0.9670 - learning_rate: 1.6296e-05
Epoch 15/60
8/8 [==============================] - ETA: 0s - loss: 0.0244	f1 (micro): 96.10
8/8 [==============================] - 19s 2s/step - loss: 0.0244 - f1: 0.9610 - learning_rate: 1.5873e-05
Epoch 16/60
8/8 [==============================] - ETA: 0s - loss: 0.0251	f1 (micro): 96.10
8/8 [==============================] - 12s 1s/step - loss: 0.0251 - f1: 0.9610 - learning_rate: 1.5450e-05
training runtime: 457.32 seconds 
model config file saved
preprocessor saved
transformer config saved
transformer tokenizer saved
model saved

The initial learning rate is 2e-05 (0.00002), but 🤔 🤔 it seems that in the first epochs it's floating up and down, before decreasing after epoch 6. Is this normal? AFAIK it should go down
Trying to figure out that, I noticed that in the wrapper, there is a parameter lr_decay=0.9 but I'm not able to see it used anyway, so for example for the transformer we have :

optimizer, lr_schedule = create_optimizer(
                init_lr=self.training_config.learning_rate,
                num_train_steps=nb_train_steps,
                weight_decay_rate=0.01,
                num_warmup_steps=0.1 * nb_train_steps,
            )

or, for non transformers, with the Adam optimizer:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
                initial_learning_rate=self.training_config.learning_rate,
                decay_steps=nb_train_steps,
                decay_rate=0.1)
            optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

For this case should I assume decay_rate = 1- lr_decay?
What about the transformers?

@lfoppiano lfoppiano changed the title add learning rate as parameter add learning rate visualisation and manual parameter Jun 15, 2023
@lfoppiano
Copy link
Collaborator Author

By removing the warmup steps, the learning rate does not float around.
Are we sure the warmup steps are necessary for fine-tuning? 🤔

@kermitt2
Copy link
Owner

Yes normally warm-up is important when fine-tuning with transformers and ELMo (if I remember well, warmup is more important than the decay in learning rate!).

The create_optimizer method that manages learning rate and warmup comes directly from the transformer library and it might work as expected with the up and down. The warmup applies in the first epochs with a lower learning rates to avoid sudden overfitting at the very beginning of the training. So with warmup, LR should start lower than init_lr, and, only after the warmup phase is done, the LR has then the init_lr value.

delft/sequenceLabelling/wrapper.py Outdated Show resolved Hide resolved
delft/textClassification/wrapper.py Outdated Show resolved Hide resolved
@kermitt2 kermitt2 merged commit 2f8976c into master Jul 16, 2023
1 check passed
@lfoppiano lfoppiano deleted the features/learning_rate_param branch August 9, 2023 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants