-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked model #288
Reworked model #288
Conversation
…g/icefall into attention_relu_specaug
The reason I am not proposing to merge this right now is, I am trying a further cleanup/modification to this recipe where we use a much simpler learning rate schedule and initialization, and an optimizer that tries to keep the non-scalar parameters at a fixed rms value (0.1). That way the relationship of the learning-rate schedule and the actual rate of parameter movement will become more intuitive. |
Here WER results after the latest reworking. It is run with something like this:
... with only --world-size and --num-epochs being possibly changed; the learning-rate setup is designed so that you don't have to change it when you change the --world-size. With libri-100,
More details of libri-100 expts: With full librispeech
This was run here: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_full https://tensorboard.dev/experiment/UKI6z9BvT6iaUkXPxex1OA/ |
Note: I just uploaded the pretrained model, training logs, decoding logs, and decoding results for this PR to |
Not really, I'm afraid. I got rid of the batchnorm and replaced it with a form of LayerNorm that has a large trainable epsilon, that's probably the most significant change. Also I gave all parameters a trainable scalar scale (it's trained in log space). At some point I may push this into the optimizer though. |
Thanks! |
@danpovey In reworked conformer, it use post-norm, and basicnorm is always company with an ActivationBalancer, which means to constrain median of output to be close to zero. However, in Conv2dSubsampling it is basicnorm->ActivationBalancer, while in conformer blocks it is ActivationBalancer->basicnorm.
conformer blocks: |
The order of the ActivationBalancer versus BasicNorm makes little difference because the BasicNorm in this case makes ake very little difference to the scale of the output of that component. In fact, I had forgotten that there was still a normalization component in there, I don't think it is necessary. I will remove it in a future version of the recipe. |
Thanks. However, only the last of blocks have basicnorm, remove it means there is no norm layers anymore. or just left a basicnorm as post-norm. |
I think there was always an extra normalization layer in the convolution module, even before my refactoring, although it may have been in the middle somewhere. |
I am putting this up so you can see the work I have been doing on the faster-to-train, better, reworked version of the conformer.
I don't have WERs for this specific version of the model yet, I'll put them up soon.