Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to solve Exception while using another wav file: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2] ? #1457

Closed
sankulka opened this issue Nov 16, 2020 · 8 comments

Comments

@sankulka
Copy link

sankulka commented Nov 16, 2020

Describe your question
I just started off learning Nemo for ASR activities and getting exception if I send a different wav file to convert into text. Could you please share what pre-processing has to be performed for any other different wav file/format than an4 dataset

A clear and concise description of your question.
Describe what you want to achieve. And/or what NeMo APIs are unclear/confusing.
I am trying to send a wav file of < 20 sec duration to get the text output from the quartznet model. Here is a sample code:

files = ['my_sample.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
print(f"Audio in {fname} was recognized as: {transcription}")

After this, I get below exception.


RuntimeError Traceback (most recent call last)
in ()
1 files = ['my_sample.wav']
----> 2 for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
3 print(f"Audio in {fname} was recognized as: {transcription}")

14 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
24 def decorate_context(*args, **kwargs):
25 with self.class():
---> 26 return func(*args, **kwargs)
27 return cast(F, decorate_context)
28

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/models/ctc_models.py in transcribe(self, paths2audio_files, batch_size, logprobs)
158 for test_batch in temporary_datalayer:
159 logits, logits_len, greedy_predictions = self.forward(
--> 160 input_signal=test_batch[0].to(device), input_signal_length=test_batch[1].to(device)
161 )
162 if logprobs:

/usr/local/lib/python3.6/dist-packages/nemo/core/classes/common.py in call(self, wrapped, instance, args, kwargs)
509
510 # Call the method - this can be forward, or any other callable method
--> 511 outputs = wrapped(*args, **kwargs)
512
513 instance._attach_and_validate_output_types(output_types=output_types, out_objects=outputs)

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/models/ctc_models.py in forward(self, input_signal, input_signal_length, processed_signal, processed_signal_length)
394 if not has_processed_signal:
395 processed_signal, processed_signal_length = self.preprocessor(
--> 396 input_signal=input_signal, length=input_signal_length,
397 )
398

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/nemo/core/classes/common.py in call(self, wrapped, instance, args, kwargs)
509
510 # Call the method - this can be forward, or any other callable method
--> 511 outputs = wrapped(*args, **kwargs)
512
513 instance._attach_and_validate_output_types(output_types=output_types, out_objects=outputs)

/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
24 def decorate_context(*args, **kwargs):
25 with self.class():
---> 26 return func(*args, **kwargs)
27 return cast(F, decorate_context)
28

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py in forward(self, input_signal, length)
77 @torch.no_grad()
78 def forward(self, input_signal, length):
---> 79 processed_signal, processed_length = self.get_features(input_signal, length)
80
81 return processed_signal, processed_length

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py in get_features(self, input_signal, length)
247
248 def get_features(self, input_signal, length):
--> 249 return self.featurizer(input_signal, length)
250
251 @Property

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
24 def decorate_context(*args, **kwargs):
25 with self.class():
---> 26 return func(*args, **kwargs)
27 return cast(F, decorate_context)
28

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/parts/features.py in forward(self, x, seq_len)
345 # disable autocast to get full range of stft values
346 with torch.cuda.amp.autocast(enabled=False):
--> 347 x = self.stft(x)
348
349 # torch returns real, imag; so convert to magnitude

/usr/local/lib/python3.6/dist-packages/nemo/collections/asr/parts/features.py in (x)
273 win_length=self.win_length,
274 center=True,
--> 275 window=self.window.to(dtype=torch.float),
276 )
277

/usr/local/lib/python3.6/dist-packages/torch/functional.py in stft(input, n_fft, hop_length, win_length, window, center, pad_mode, normalized, onesided, return_complex)
511 extended_shape = [1] * (3 - signal_dim) + list(input.size())
512 pad = int(n_fft // 2)
--> 513 input = F.pad(input.view(extended_shape), (pad, pad), pad_mode)
514 input = input.view(input.shape[-signal_dim:])
515 return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in _pad(input, pad, mode, value)
3557 assert len(pad) == 2, '3D tensors expect 2 values for padding'
3558 if mode == 'reflect':
-> 3559 return torch._C._nn.reflection_pad1d(input, pad)
3560 elif mode == 'replicate':
3561 return torch._C._nn.replication_pad1d(input, pad)

RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2]

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
    Collab
  • Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.
    import nemo
    import nemo.collections.asr as nemo_asr
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

@sankulka sankulka changed the title [Question] Exception while using another wav file [Question] Exception while using another wav file: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2] Nov 16, 2020
@sankulka sankulka changed the title [Question] Exception while using another wav file: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2] [Question] How to solve Exception while using another wav file: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2] ? Nov 16, 2020
@okuchaiev
Copy link
Member

@sankulka few questions:

  1. Are you able to successfully execute this notebook https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.0b2/tutorials/NeMo_voice_swap_app.ipynb
  2. If you replace your file with this one https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav , does the error go away?
  3. Do you know how many channels and what sample rate is your file? (Should be single channel 16Khz)

@rbracco
Copy link
Contributor

rbracco commented Nov 18, 2020

I had the same error. It was due to my microphone being stereo (2 channel) and 44.1Khz instead of mono (1 channel) and 16Khz as required.

You can check the sample_rate and resample if needed using torchaudio

import torchaudio

y, sr = torchaudio.load('my_sample.wav')
y = y.mean(dim=0) # if there are multiple channels, average them to single channel
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    y_resampled = resampler(y)
torchaudio.save('my_sample_resampled.wav', y, sr)

files = ['my_sample_resampled.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
print(f"Audio in {fname} was recognized as: {transcription}")

@sankulka
Copy link
Author

Thanks rbracco. Yes, by changing the sample rate, it worked well. Regards.

@sankulka
Copy link
Author

sankulka commented Nov 19, 2020 via email

@Gangwaradi
Copy link

For me it's not working. showing an error:

save() missing 2 required positional arguments: 'src' and 'sample_rate'

Please help me solve this problem

@rbracco
Copy link
Contributor

rbracco commented Oct 13, 2021

Can you post code? torchaudio.save() requires 3 arguments, a filepath, the audio, and the audio's sample_rate. I've edited the code above to include all 3.

It seems like you are doing something like torchaudio.save('my_sample_resampled.wav') but that's just the filepath, the correct would be torchaudio.save('my_sample_resampled.wav', y, sr)

@Gangwaradi
Copy link

Gangwaradi commented Oct 13, 2021

Hi rbacco,
Thanks for your reply. I have solved this problem in different way. Instead of changing my recorded file I have changed my code for recording audio. Earlier I was using 2 channel while recording. When I changed it to 1 then all thing working fine. Code to record audio which have only one channel give below which is copied from https://dsp.stackexchange.com/questions/13728/what-are-chunks-when-recording-a-voice-signal

One channel recording:

import pyaudio
import wave
import sys

CHUNK = 1024
What is CHUNKS here ?
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "my_sample.wav"

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("start....")

frames = []

for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)

print("done...")

stream.stop_stream()
stream.close()
p.terminate()

wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

While prediction you can simply use:
files = ['my_sample.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
print(f"Audio in {fname} was recognized as: {transcription}")

If you have audio file with two channel:

I have also found a different way to solve this problem. Error is created due to having two channel while recording , So we can take just one channel because both channel have minute change. I have change my code which provide me accurate result. Code is given below.

import torchaudio
import torch

y, sr = torchaudio.load('my_sample.wav')
y = torch.reshape(y[0], (1, y[0].size(0)))
torchaudio.save('my_sample_resampled.wav',y ,sr)

files = ['my_sample_resampled.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
print(f"Audio in {fname} was recognized as: {transcription}")

Or you can take average both channel and convert it into one channel.

import torchaudio

y, sr = torchaudio.load('/content/output.wav')
y = y.mean(dim = 0) # if there are multiple channels, average them to single channel
y = torch.reshape(y,(1, y.size(0)))
torchaudio.save('my_sample_resampled.wav', y, sr)

files = ['my_sample_resampled.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
print(f"Audio in {fname} was recognized as: {transcription}")

@sheecegardezi
Copy link

For me this error was being generated because the wav file had stereo channels. I needed to convert the file to mono channel:

from pydub import AudioSegment
file_path = "input_sound_file.wav"
sound = AudioSegment.from_wav(file_path)
sound = sound.set_channels(1)
sound.export(file_path, format="wav")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants