Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Pruned_transducer_stateless for WenetSpeech #274

Conversation

luomingshuang
Copy link
Collaborator

This PR is for pruned_transducer_stateless on WenetSpeech. In this PR, I set three toke_types for modeling: char, pinyin and lazy_pinyin. I also run the codes with 100 hours wenetspeech data normally.

  • Preparation feature and lexicon (still extracting features for the whole data).
  • Ensure training and decoding normally with 100 hours data.
  • Training with all data with three token_types.
  • Codes modify and results analysis.


device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda", 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always use device 0.
You can use CUDA_VISIBLE_DEVICES to control which devices are available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@luomingshuang
Copy link
Collaborator Author

I will re-organize the codes based on this PR #288.

@danpovey
Copy link
Collaborator

I suggest to just copy-and-modify to a different directory, pruned_transducer_stateless2. If you want you can delete the original directory (once this is tested and working).

@luomingshuang
Copy link
Collaborator Author

OK,I will make a new directory to do this.

I suggest to just copy-and-modify to a different directory, pruned_transducer_stateless2. If you want you can delete the original directory (once this is tested and working).

@luomingshuang
Copy link
Collaborator Author

When I set max-duration to 150 for char experiment (the number of char units is about 5000+), there is a bug showing as the picture:
image

And when I set max-duration to 230 for pinyin experiment (the number of pinyin units is about 400+), there is no bug showing as the above.

@pkufool
Copy link
Collaborator

pkufool commented Apr 13, 2022

If the vocab-size is only 400+, I think you can use max-duration=300 or 350.

@luomingshuang
Copy link
Collaborator Author

There is a long time for waiting when I use greedy_search to decode with char. There are some recordings:
When using DynamicBucketingSampler for test_dataloader:

loading data time: 5.604447841644287
encoder time:  0.43555712699890137
decode ids time:  0.0836038589477539
ids2txt time:  0.0006973743438720703
decoding time: 0.5221517086029053
2022-04-13 20:37:54,292 INFO [decode.py:481] batch 0/?, cuts processed until now is 30
loading data time: 0.0037398338317871094
encoder time:  0.1393582820892334
decode ids time:  0.07378888130187988
ids2txt time:  0.0006480216979980469
decoding time: 0.21699810028076172
loading data time: 56.69001579284668
encoder time:  0.20103120803833008
decode ids time:  0.036595821380615234
ids2txt time:  0.0004909038543701172
decoding time: 0.24178481101989746
loading data time: 0.0047762393951416016
encoder time:  0.1527702808380127
decode ids time:  0.06283783912658691
ids2txt time:  0.0004899501800537109
decoding time: 0.21843838691711426
loading data time: 1.8183343410491943
encoder time:  0.15730977058410645
decode ids time:  0.09359860420227051
ids2txt time:  0.000522613525390625
decoding time: 0.2537844181060791
loading data time: 0.0042285919189453125
encoder time:  0.15011096000671387
decode ids time:  0.08310127258300781
ids2txt time:  0.0005218982696533203
decoding time: 0.23593735694885254
loading data time: 8.138154983520508
encoder time:  0.1609656810760498
decode ids time:  0.07918024063110352
ids2txt time:  0.00038909912109375
decoding time: 0.2427058219909668
loading data time: 8.696962356567383
encoder time:  0.15475201606750488
decode ids time:  0.0628821849822998
ids2txt time:  0.00037980079650878906
decoding time: 0.22055697441101074
loading data time: 0.003143310546875
encoder time:  0.15232491493225098
decode ids time:  0.07724738121032715
ids2txt time:  0.0004928112030029297
decoding time: 0.23208999633789062
loading data time: 11.180747032165527

When using BucketingSampler for test_dataloader:

2022-04-13 20:31:33,944 INFO [asr_datamodule.py:415] About to get TEST_MEETING cuts
loading data time: 43.95293688774109
encoder time:  0.4218623638153076
decode ids time:  0.03766632080078125
ids2txt time:  0.0007059574127197266
decoding time: 0.4627194404602051
2022-04-13 20:32:19,103 INFO [decode.py:481] batch 0/?, cuts processed until now is 101
loading data time: 33.056549072265625
encoder time:  0.13587069511413574
decode ids time:  0.027216196060180664
ids2txt time:  0.0005970001220703125
decoding time: 0.16941094398498535
loading data time: 0.0024077892303466797
encoder time:  0.13356399536132812
decode ids time:  0.044828176498413086
ids2txt time:  0.0008339881896972656
decoding time: 0.18121767044067383
loading data time: 14.005029201507568
encoder time:  0.1317129135131836
decode ids time:  0.05815482139587402
ids2txt time:  0.00047087669372558594
decoding time: 0.1924729347229004
loading data time: 0.0017595291137695312
encoder time:  0.1419663429260254
decode ids time:  0.09049439430236816
ids2txt time:  0.0005125999450683594
decoding time: 0.23549747467041016
loading data time: 82.38904309272766
encoder time:  0.1491081714630127
decode ids time:  0.03131890296936035
ids2txt time:  0.0005147457122802734
decoding time: 0.18295669555664062
loading data time: 0.007472991943359375
encoder time:  0.13613653182983398
decode ids time:  0.07791972160339355
ids2txt time:  0.00048828125
decoding time: 0.2160487174987793
loading data time: 5.937158584594727
encoder time:  0.14725565910339355
decode ids time:  0.07642960548400879
ids2txt time:  0.0005576610565185547
decoding time: 0.22643184661865234
loading data time: 0.0073528289794921875
encoder time:  0.1385974884033203
decode ids time:  0.0738532543182373
ids2txt time:  0.0004916191101074219
decoding time: 0.21514129638671875

The loading date time refers to the time for loading each batch. We can see that loading data takes a long time. So how can I improve it? Can somebody give me some suggestions?@pzelasko

@pzelasko
Copy link
Collaborator

You can always try increasing the number of dataloader workers.. if you're using on the fly features, consider precomputing them. If none of the above helps, is it possible that you have a very slow disk? You can make a copy of your test data to move to sequential I/O reads which will be much more faster at the cost of extra storage, there is a tutorial here: https://github.com/lhotse-speech/lhotse/blob/master/examples/02-webdataset-integration.ipynb (you will need to install a specific version of webdataset==0.1.103, I am intending to support these things natively without external dependencies in the future).

@luomingshuang
Copy link
Collaborator Author

Thanks. I will try your suggestions. And I will update the progress on real time.

@luomingshuang
Copy link
Collaborator Author

In my experiments, I set on_the_fly_feats as False. I try to change the num_workers to explore the influence to decoding speed.
But the results show that it seems the num_workers can not help decode quickily.

When using num-worker=1 for greedy_search decoding:

loading data time: 2.166102170944214
encoder time:  0.8633573055267334
decode ids time:  0.18378877639770508
ids2txt time:  0.000675201416015625
decoding time: 1.0512535572052002
2022-04-14 12:32:04,054 INFO [decode.py:481] batch 0/?, cuts processed until now is 15
loading data time: 4.632451295852661
encoder time:  0.11210012435913086
decode ids time:  0.16776299476623535
ids2txt time:  0.0003237724304199219
decoding time: 0.281538724899292
loading data time: 40.657896280288696
encoder time:  0.14378142356872559
decode ids time:  0.025411367416381836
ids2txt time:  0.0003261566162109375
decoding time: 0.17122912406921387

When using num-worker=2 for greedy_search decoding:

loading data time: 2.559886932373047
encoder time:  0.987330436706543
decode ids time:  0.16489362716674805
ids2txt time:  0.0006937980651855469
decoding time: 1.1545872688293457
2022-04-14 12:21:58,191 INFO [decode.py:481] batch 0/?, cuts processed until now is 15
loading data time: 8.31162405014038
encoder time:  0.11584949493408203
decode ids time:  0.1570277214050293
ids2txt time:  0.000347137451171875
decoding time: 0.2750985622406006
loading data time: 47.80948853492737
encoder time:  0.13053321838378906
decode ids time:  0.06158638000488281
ids2txt time:  0.00035572052001953125
decoding time: 0.19430136680603027
loading data time: 0.006636381149291992
encoder time:  0.0931394100189209
decode ids time:  0.08658957481384277
ids2txt time:  0.000339508056640625
decoding time: 0.18132495880126953
loading data time: 1.1959447860717773
encoder time:  0.12535548210144043
decode ids time:  0.11600446701049805
ids2txt time:  0.0004267692565917969
decoding time: 0.24329781532287598

When using num-worker=16 for greedy_search decoding:

loading data time: 2.922316789627075
encoder time:  0.7544920444488525
decode ids time:  0.08665108680725098
ids2txt time:  0.0006394386291503906
decoding time: 0.8435213565826416
2022-04-14 12:30:14,407 INFO [decode.py:481] batch 0/?, cuts processed until now is 15
loading data time: 8.929579019546509
encoder time:  0.12003612518310547
decode ids time:  0.15832257270812988
ids2txt time:  0.00033593177795410156
decoding time: 0.280200719833374
loading data time: 34.43920588493347
encoder time:  0.18253159523010254
decode ids time:  0.04781389236450195
ids2txt time:  0.00038051605224609375
decoding time: 0.23261594772338867
loading data time: 0.003057718276977539
encoder time:  0.1268634796142578
decode ids time:  0.13359546661376953
ids2txt time:  0.00042510032653808594
decoding time: 0.2625749111175537
loading data time: 0.0026183128356933594
encoder time:  0.08490252494812012
decode ids time:  0.19762063026428223
ids2txt time:  0.0003371238708496094
decoding time: 0.2843194007873535
loading data time: 0.013019323348999023
encoder time:  0.11155819892883301
decode ids time:  0.1806035041809082
ids2txt time:  0.00043272972106933594
decoding time: 0.3071293830871582

When using num-worker=32 for greedy_search decoding:

loading data time: 9.982702016830444
encoder time:  1.258249044418335
decode ids time:  0.2896277904510498
ids2txt time:  0.0006508827209472656
decoding time: 1.5503191947937012
2022-04-14 12:26:32,478 INFO [decode.py:481] batch 0/?, cuts processed until now is 15
loading data time: 17.355175018310547
encoder time:  0.15750622749328613
decode ids time:  0.19689679145812988
ids2txt time:  0.0003261566162109375
decoding time: 0.35667920112609863
loading data time: 87.19627594947815
encoder time:  0.16827678680419922
decode ids time:  0.0541844367980957
ids2txt time:  0.0003666877746582031
decoding time: 0.22466683387756348
loading data time: 0.012465953826904297
encoder time:  0.10545873641967773
decode ids time:  0.13120722770690918
ids2txt time:  0.0003581047058105469
decoding time: 0.23841333389282227
loading data time: 0.0027811527252197266
encoder time:  0.1522512435913086
decode ids time:  0.24107146263122559
ids2txt time:  0.0003333091735839844
decoding time: 0.3950479030609131

@luomingshuang
Copy link
Collaborator Author

I am trying to use webdataset for this. But a bug happens to me. webdataset/webdataset#171. Hoping the members in the webdataset group can help to solve it. -_-

@pzelasko
Copy link
Collaborator

For some reason they removed that class 3 weeks ago. You can pip install webdataset==0.1.103

@luomingshuang
Copy link
Collaborator Author

At present, the webdataset seems to be able to play a useful role for reducing the time of loading data.

loading data time: 0.4134948253631592
2022-04-14 19:48:51,825 INFO [decode.py:476] batch 0/?, cuts processed until now is 14
loading data time: 0.03080129623413086
loading data time: 0.0007903575897216797
loading data time: 0.030358552932739258
loading data time: 0.0011277198791503906
loading data time: 0.02306675910949707
loading data time: 0.0010628700256347656
loading data time: 0.022269010543823242
loading data time: 0.0009980201721191406
loading data time: 0.021704912185668945
loading data time: 0.0008711814880371094
loading data time: 0.023546695709228516
loading data time: 0.0008645057678222656
loading data time: 0.024214982986450195
loading data time: 0.0008935928344726562
loading data time: 0.02700185775756836
loading data time: 0.0009036064147949219
loading data time: 0.02424311637878418
loading data time: 0.0008471012115478516
loading data time: 0.02619624137878418
loading data time: 0.0008234977722167969
loading data time: 0.024827003479003906
loading data time: 0.0009663105010986328
loading data time: 0.025295734405517578
loading data time: 0.0010123252868652344
loading data time: 0.024758577346801758
loading data time: 0.0009906291961669922
loading data time: 0.022017717361450195
loading data time: 0.0011665821075439453
loading data time: 0.022162437438964844

@danpovey
Copy link
Collaborator

@luomingshuang where did you provide the num_workers arg, can you show a code snippet?

test,
batch_size=None,
sampler=sampler,
num_workers=self.args.num_workers,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apply the args.num_workers for test_dataloader here.

@@ -361,10 +363,15 @@ def test_dataloaders(self, cuts: CutSet) -> DataLoader:
sampler = DynamicBucketingSampler(
cuts, max_duration=self.args.max_duration, shuffle=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must set rank=0, world_size=1 for the sampler, otherwise you might be dropping some data. There should be a big warning about this in your run logs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em....I check my decoding logs. It has no warning. BTW, I will add these to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's possible if you used a single-GPU process to decode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's possible if you used a single-GPU process to decode.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants