-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when training ASR with Transducers on GPU #4863
Comments
Can you expand the entire stack frame and paste the entire error here. It's a Numba error when calculating the loss, it has probably little to do with spec augment |
TypeError Traceback (most recent call last) 42 frames /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _call_and_handle_interrupt(self, trainer_fn, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _run(self, model, ckpt_path) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _run_stage(self) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _run_train(self) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py in run(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/fit_loop.py in advance(self) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py in run(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py in advance(self, data_fetcher) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py in run(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py in advance(self, batch, batch_idx) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/base.py in run(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in advance(self, batch, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in _run_optimization(self, split_batch, batch_idx, optimizer, opt_idx) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in _optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _call_lightning_module_hook(self, hook_name, pl_module, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/optimizer.py in step(self, closure, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/strategy.py in optimizer_step(self, optimizer, opt_idx, closure, model, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py in optimizer_step(self, model, optimizer, optimizer_idx, closure, **kwargs) /usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs) /usr/local/lib/python3.7/dist-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs) /usr/local/lib/python3.7/dist-packages/nemo/core/optim/novograd.py in step(self, closure) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py in _wrap_closure(self, model, optimizer, optimizer_idx, closure) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in call(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in closure(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py in _training_step(self, split_batch, batch_idx, opt_idx) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py in _call_strategy_hook(self, hook_name, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/pytorch_lightning/strategies/strategy.py in training_step(self, *args, **kwargs) /usr/local/lib/python3.7/dist-packages/nemo/utils/model_utils.py in wrap_training_step(wrapped, instance, args, kwargs) /usr/local/lib/python3.7/dist-packages/nemo/collections/asr/models/rnnt_models.py in training_step(self, batch, batch_nb) /usr/local/lib/python3.7/dist-packages/nemo/core/classes/common.py in call(self, wrapped, instance, args, kwargs) /usr/local/lib/python3.7/dist-packages/nemo/collections/asr/models/rnnt_models.py in forward(self, input_signal, input_signal_length, processed_signal, processed_signal_length) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) /usr/local/lib/python3.7/dist-packages/nemo/core/classes/common.py in call(self, wrapped, instance, args, kwargs) /usr/local/lib/python3.7/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py in forward(self, input_spec, length) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) /usr/local/lib/python3.7/dist-packages/nemo/core/classes/common.py in call(self, wrapped, instance, args, kwargs) /usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs) /usr/local/lib/python3.7/dist-packages/nemo/collections/asr/parts/numba/spec_augment/spec_aug_numba.py in forward(self, input_spec, length) /usr/local/lib/python3.7/dist-packages/nemo/collections/asr/parts/numba/spec_augment/spec_aug_numba.py in launch_spec_augment_kernel(x, x_len, freq_starts, freq_lengths, time_starts, time_lengths, freq_masks, time_masks, mask_value) /usr/local/lib/python3.7/dist-packages/numba/cuda/dispatcher.py in getitem(self, args) TypeError: unhashable type: 'list' |
Can you print out the system details - pytorch, Nemo, Numba versions. Seems to be the Numba kernels call for spec augment, but there shouldn't be a reason for it to crash like this.
|
This pip freeze I run in the google colab absl-py==1.2.0 |
Hmm, ivr never tried it with Numba 0.56. I will try it locally later this month, I'm on vacation till the 19th. |
Thank you so much for helping and looking into it even if you are on vacation. I will try to run it with earlier numba versions and report back :) |
checked with numba==0.53.1 everything worked |
Interesting. Also, numba just had a release of 0.56.2 which said it had cuda function caching bugfixes. Doubtful it's your case but could you try it ? |
Ok I'll take a look once I'm back from vacation |
This should now be fixed on main and the next release (r1.12.0) |
when training ASR with Transducers and leaving
freq_masks: 2
time_masks: 10
instead of zeroing them out an error occurs
TypeError Traceback (most recent call last)
in
1 # Train the model
----> 2 trainer.fit(model)
42 frames
/usr/local/lib/python3.7/dist-packages/numba/cuda/dispatcher.py in getitem(self, args)
567 if len(args) not in [2, 3, 4]:
568 raise ValueError('must specify at least the griddim and blockdim')
--> 569 return self.configure(*args)
570
571 def forall(self, ntasks, tpb=0, stream=0, sharedmem=0):
TypeError: unhashable type: 'list'
It happens both in google colab and when training on local machine.
To reproduce this I skipped
config.model.spec_augment.freq_masks = 0
config.model.spec_augment.time_masks = 0
in colab tutorial ASR_with_Transducers
if I zero spec_augment everything works. Could you check if it's a possible bug or it is something I do wrong
The text was updated successfully, but these errors were encountered: