Reworked model #288

danpovey · 2022-04-02T13:13:53Z

I am putting this up so you can see the work I have been doing on the faster-to-train, better, reworked version of the conformer.
I don't have WERs for this specific version of the model yet, I'll put them up soon.

…g/icefall into attention_relu_specaug

…ction 0.4->0.3

…related changes.

…d units.

…lution module

danpovey · 2022-04-06T05:52:44Z

The reason I am not proposing to merge this right now is, I am trying a further cleanup/modification to this recipe where we use a much simpler learning rate schedule and initialization, and an optimizer that tries to keep the non-scalar parameters at a fixed rms value (0.1). That way the relationship of the learning-rate schedule and the actual rate of parameter movement will become more intuitive.

danpovey · 2022-04-11T06:59:54Z

Here WER results after the latest reworking. It is run with something like this:

python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_1a --world-size 1 --num-epochs 30  --start-epoch 0 --full-libri 0 --max-duration 300

... with only --world-size and --num-epochs being possibly changed; the learning-rate setup is designed so that you don't have to change it when you change the --world-size.

With libri-100,

            num-jobs       test_clean||test_other
                                    epoch=19,avg=8             epoch=29,avg=8         epoch=39,avg=10
             1                       7.12||18.42
             2                      7.05||18.77                6.82||18.14                 6.81||17.66
             4                      7.31||19.55                 7.08||18.59                6.86||18.29

More details of libri-100 expts:
[note: due to a bug that should be fixed now, the learning rate plot cannot be seen.]
1 job: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor https://tensorboard.dev/experiment/AhnhooUBRPqTnaggoqo7lg/
2 jobs: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor_ws2 https://tensorboard.dev/experiment/dvOC9wsrSdWrAIdsebJILg/
4 jobs: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_refactor_ws4 https://tensorboard.dev/experiment/a3T0TyC0R5aLj5bmFbRErA/

With full librispeech

         num-jobs          test_clean||test_other
                                epoch=19,avg=8
               8                   2.75||6.53

This was run here: /ceph-dan/icefall/egs/librispeech/ASR/rework2p_full https://tensorboard.dev/experiment/UKI6z9BvT6iaUkXPxex1OA/

csukuangfj · 2022-04-29T07:12:44Z

Note: I just uploaded the pretrained model, training logs, decoding logs, and decoding results for this PR to
https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29

desh2608 · 2022-05-16T20:22:48Z

@danpovey is there some documentation about the major changes between the original conformer and the "reworked" conformer?

Update (Sep 25, 2022): Details about the reworked conformer model are explained in this post based on this talk

danpovey · 2022-05-17T04:38:23Z

Not really, I'm afraid. I got rid of the batchnorm and replaced it with a form of LayerNorm that has a large trainable epsilon, that's probably the most significant change. Also I gave all parameters a trainable scalar scale (it's trained in log space). At some point I may push this into the optimizer though.
Also the (non-scalar) parameters are limited in the optimizer to rms=0.1 (we scale the tensors down if they exceed this); we rely on the scalar scales to learn the magnitudes; and the learning rate schedule is different. There is no warmup in the learning-rate schedule; instead, we have something called "model warmup" which is a value that you pass into the model that varies from 0 to 1 over the first 3k batches; it controls a bypass on the conformer encoder layers.

desh2608 · 2022-05-17T12:32:47Z

Thanks!

wangers · 2022-09-25T13:15:48Z

@danpovey In reworked conformer, it use post-norm, and basicnorm is always company with an ActivationBalancer, which means to constrain median of output to be close to zero. However, in Conv2dSubsampling it is basicnorm->ActivationBalancer, while in conformer blocks it is ActivationBalancer->basicnorm.
Conv2dSubsampling :

        x = self.out_norm(x)
        x = self.out_balancer(x)

conformer blocks:
src = self.norm_final(self.balancer(src))
Is there some explanation to how to desin a basicnorm-activationbalancer such as tdnn-relu-basicnorm?

danpovey · 2022-09-25T15:49:57Z

The order of the ActivationBalancer versus BasicNorm makes little difference because the BasicNorm in this case makes ake very little difference to the scale of the output of that component. In fact, I had forgotten that there was still a normalization component in there, I don't think it is necessary. I will remove it in a future version of the recipe.

wangers · 2022-09-25T16:54:04Z

Thanks. However, only the last of blocks have basicnorm, remove it means there is no norm layers anymore. or just left a basicnorm as post-norm.

danpovey · 2022-09-26T07:37:30Z

I think there was always an extra normalization layer in the convolution module, even before my refactoring, although it may have been in the middle somewhere.
Likely what happened is I wasn't able to remove that normalization without the model diverging, at that time, but since then we have made a lot of other changes that help stability, so I think this won't be an issue in future.

pkufool and others added 30 commits February 6, 2022 18:22

Fix torch.nn.Embedding error for torch below 1.8.0

fcd25bd

Changes to fbank computation, use lilcom chunky writer

8f8ec22

Add min in q,k,v of attention

48a764e

Remove learnable offset, use relu instead.

a859dcb

Experiments based on SpecAugment change

3323cab

Merge branch 'spec-augment-change' of https://github.com/luomingshuan…

395065e

…g/icefall into attention_relu_specaug

Merge specaug change from Mingshuang.

beaf5bf

Use much more aggressive SpecAug setup

bd36216

Fix to num_feature_masks bug I introduced; reduce max_frames_mask_fra…

dd19a6a

…ction 0.4->0.3

Change p=0.5->0.9, mask_fraction 0.3->0.2

8aa50df

Change p=0.9 to p=0.8 in SpecAug

c170c53

Fix num_time_masks code; revert 0.8 to 0.9

4cd2c02

Change max_frames from 0.2 to 0.15

d187ad8

Remove ReLU in attention

2af1b3a

Adding diagnostics code...

581786a

Refactor/simplify ConformerEncoder

63d8d93

First version of rand-combine iterated-training-like idea.

c1063de

Improvements to diagnostics (RE those with 1 dim

2ff520c

Add pelu to this good-performing setup..

9d1b4ae

Small bug fixes/imports

9ed7d55

Add baseline for the PeLU expt, keeping only the small normalization-…

3fb559d

…related changes.

pelu_base->expscale, add 2xExpScale in subsampling, and in feedforwar…

5c177fc

…d units.

Double learning rate of exp-scale units

23b3aa2

Combine ExpScale and swish for memory reduction

bc6c720

Add import

cd216f5

Fix backprop bug

3d9ddc2

Fix bug in diagnostics

503f8d5

Increase scale on Scale from 4 to 20

3207bd9

Increase scale from 20 to 50.

7e88999

Fix duplicate Swish; replace norm+swish with swish+exp-scale in convo…

9cc5999

…lution module

Remove initial_speed

61486a0

danpovey added 12 commits April 8, 2022 16:10

Set new scheduler

6ee32cf

Change exponential part of lrate to be epoch based

f587cd5

Fix bug

0f8ee68

Set 2n rule..

db72aee

Implement 2o schedule

4d41ee0

Make lrate rule more symmetric

da50525

Implement 2p version of learning rate schedule.

82d5862

Refactor how learning rate is set.

d1e4ae7

Fix import

962cf86

Fix dir names

46d52dd

Modify beam search to be efficient with current joienr

d5f9d49

Fix adding learning rate to tensorboard

5078332

danpovey merged commit 6eb6d9b into k2-fsa:master Apr 11, 2022

luomingshuang mentioned this pull request Apr 11, 2022

[WIP] Pruned_transducer_stateless for WenetSpeech #274

Closed

4 tasks

danpovey mentioned this pull request Apr 11, 2022

Reworked version of CTC+attention model #304

Open

pkufool mentioned this pull request Apr 11, 2022

Support mix precision training on the reworked model #305

Merged

This was referenced Apr 12, 2022

Modified conformer with multi datasets #312

Merged

Try messing with buffer size? lhotse-speech/lhotse#666

Open

csukuangfj mentioned this pull request May 23, 2022

Narrower and deeper conformer #330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked model #288

Reworked model #288

danpovey commented Apr 2, 2022

danpovey commented Apr 6, 2022

danpovey commented Apr 11, 2022

csukuangfj commented Apr 29, 2022

desh2608 commented May 16, 2022 •

edited

Loading

danpovey commented May 17, 2022 •

edited

Loading

desh2608 commented May 17, 2022

wangers commented Sep 25, 2022

danpovey commented Sep 25, 2022

wangers commented Sep 25, 2022

danpovey commented Sep 26, 2022

Reworked model #288

Reworked model #288

Conversation

danpovey commented Apr 2, 2022

danpovey commented Apr 6, 2022

danpovey commented Apr 11, 2022

csukuangfj commented Apr 29, 2022

desh2608 commented May 16, 2022 • edited Loading

danpovey commented May 17, 2022 • edited Loading

desh2608 commented May 17, 2022

wangers commented Sep 25, 2022

danpovey commented Sep 25, 2022

wangers commented Sep 25, 2022

danpovey commented Sep 26, 2022

desh2608 commented May 16, 2022 •

edited

Loading

danpovey commented May 17, 2022 •

edited

Loading