The more I pretrain (SSL), the worse fine-tuned model gets? #9175
-
Hi, I'm trying out SSL pretraining for ASR. I have about 50k hours of unlabeled data and 2k hours of transcribed data. With 100k hours of data, first of all I couldn't get the default dataloader to work because CPU runs out of memory within one epoch. Then I train with only contrastive loss, but after about 50k steps SSL contrastive loss kind of starts to stagnate. I use two checkpoints one from around 40k steps and another from the end of 80k steps to finetune with labeled ASR data. The checkpoint from 80k steps converges much slower. It seems that the more I pretrain, the worse pretrained model I get. Is it expected? What could I be doing wrong? Furthermore, this is not an isolated case, I've tried other combinations of parameters / models e.g. Conformer, Fast-conformer, and tried smaller dataset for pretraining and with smaller data the pretrained model seems to be better as well. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 15 replies
-
Hello, could you tried to low your learning rate or used another optimizers like AdamW ? |
Beta Was this translation helpful? Give feedback.
-
@nithinraok can you comment |
Beta Was this translation helpful? Give feedback.
-
Another problem I ran into is that with the same data sampling and setting, the training phase runs fine but validation fails at the contrastive loss calculation:
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
To leverage the benefit of SSL, you should use atleast XL size model with >300M parameters. |
Beta Was this translation helpful? Give feedback.
I am not sure if you are using conformer or fastconformer and the model size. These factors affect the training speed. @pzelasko could you validate lhotse arguments.
I could confirm that for non causal models pretraining definitely helps. With causal models we haven;t experimented much and I am currently training some very large FastConformer models, so I would only know sooner when the training gets finished. But based on my experiments, pretraining will always help for stable training, improved performance and faster convergence.