Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked model #288

Merged
merged 199 commits into from
Apr 11, 2022
Merged

Reworked model #288

merged 199 commits into from
Apr 11, 2022

Conversation

danpovey
Copy link
Collaborator

@danpovey danpovey commented Apr 2, 2022

I am putting this up so you can see the work I have been doing on the faster-to-train, better, reworked version of the conformer.
I don't have WERs for this specific version of the model yet, I'll put them up soon.

pkufool and others added 30 commits February 6, 2022 18:22
@danpovey
Copy link
Collaborator Author

danpovey commented Apr 6, 2022

The reason I am not proposing to merge this right now is, I am trying a further cleanup/modification to this recipe where we use a much simpler learning rate schedule and initialization, and an optimizer that tries to keep the non-scalar parameters at a fixed rms value (0.1). That way the relationship of the learning-rate schedule and the actual rate of parameter movement will become more intuitive.

@danpovey
Copy link
Collaborator Author

Here WER results after the latest reworking. It is run with something like this:

python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_1a --world-size 1 --num-epochs 30  --start-epoch 0 --full-libri 0 --max-duration 300 

... with only --world-size and --num-epochs being possibly changed; the learning-rate setup is designed so that you don't have to change it when you change the --world-size.

With libri-100,

            num-jobs       test_clean||test_other
                                    epoch=19,avg=8             epoch=29,avg=8         epoch=39,avg=10
             1                       7.12||18.42
             2                      7.05||18.77                6.82||18.14                 6.81||17.66
             4                      7.31||19.55                 7.08||18.59                6.86||18.29

More details of libri-100 expts:
[note: due to a bug that should be fixed now, the learning rate plot cannot be seen.]
1 job: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor https://tensorboard.dev/experiment/AhnhooUBRPqTnaggoqo7lg/
2 jobs: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor_ws2 https://tensorboard.dev/experiment/dvOC9wsrSdWrAIdsebJILg/
4 jobs: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor_ws4 https://tensorboard.dev/experiment/a3T0TyC0R5aLj5bmFbRErA/

With full librispeech

         num-jobs          test_clean||test_other
                                epoch=19,avg=8
               8                   2.75||6.53

This was run here: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_full https://tensorboard.dev/experiment/UKI6z9BvT6iaUkXPxex1OA/

@csukuangfj
Copy link
Collaborator

Note: I just uploaded the pretrained model, training logs, decoding logs, and decoding results for this PR to
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29

@desh2608
Copy link
Collaborator

desh2608 commented May 16, 2022

@danpovey is there some documentation about the major changes between the original conformer and the "reworked" conformer?

Update (Sep 25, 2022): Details about the reworked conformer model are explained in this post based on this talk

@danpovey
Copy link
Collaborator Author

danpovey commented May 17, 2022

Not really, I'm afraid. I got rid of the batchnorm and replaced it with a form of LayerNorm that has a large trainable epsilon, that's probably the most significant change. Also I gave all parameters a trainable scalar scale (it's trained in log space). At some point I may push this into the optimizer though.
Also the (non-scalar) parameters are limited in the optimizer to rms=0.1 (we scale the tensors down if they exceed this); we rely on the scalar scales to learn the magnitudes; and the learning rate schedule is different. There is no warmup in the learning-rate schedule; instead, we have something called "model warmup" which is a value that you pass into the model that varies from 0 to 1 over the first 3k batches; it controls a bypass on the conformer encoder layers.

@desh2608
Copy link
Collaborator

Thanks!

@wangers
Copy link

wangers commented Sep 25, 2022

@danpovey In reworked conformer, it use post-norm, and basicnorm is always company with an ActivationBalancer, which means to constrain median of output to be close to zero. However, in Conv2dSubsampling it is basicnorm->ActivationBalancer, while in conformer blocks it is ActivationBalancer->basicnorm.
Conv2dSubsampling :

        x = self.out_norm(x)
        x = self.out_balancer(x)

conformer blocks:
src = self.norm_final(self.balancer(src))
Is there some explanation to how to desin a basicnorm-activationbalancer such as tdnn-relu-basicnorm?

@danpovey
Copy link
Collaborator Author

The order of the ActivationBalancer versus BasicNorm makes little difference because the BasicNorm in this case makes ake very little difference to the scale of the output of that component. In fact, I had forgotten that there was still a normalization component in there, I don't think it is necessary. I will remove it in a future version of the recipe.

@wangers
Copy link

wangers commented Sep 25, 2022

Thanks. However, only the last of blocks have basicnorm, remove it means there is no norm layers anymore. or just left a basicnorm as post-norm.

@danpovey
Copy link
Collaborator Author

I think there was always an extra normalization layer in the convolution module, even before my refactoring, although it may have been in the middle somewhere.
Likely what happened is I wasn't able to remove that normalization without the model diverging, at that time, but since then we have made a lot of other changes that help stability, so I think this won't be an issue in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants