Batch processing torchaudio-squim #3424

bloodraven66 · 2023-06-08T13:59:44Z

🚀 The feature

This is regarding the objective and subjective metrics available as part of torchaudio-squim (https://pytorch.org/audio/main/tutorials/squim_tutorial.html#sphx-glr-tutorials-squim-tutorial-py).

Currently, it works only at batch size = 1. i.e, the waveforms are expected to be of shape (1, N). Can we have batch level processing?

Motivation, pitch

Researchers usually use these metrics on test sets and a range of model configurations. I'm also looking at using the subjective model on multiple non-matching references for a single audio. Batch processing will help a lot in speeding up the processing time.

Alternatives

No response

Additional context

No response

nateanl · 2023-06-08T15:31:04Z

Hi @bloodraven66, thanks for trying the squim model for evaluation. Actually you can! Just pass a batch tensor to the model, it will generate scores also in batch.

mthrok · 2023-06-09T10:44:43Z

I guess we can update the tutorial so that it contains links to documentation, using :py:class:.

I had a bit of difficulty to find the answer to this, but the following seems to be the documentation.

https://pytorch.org/audio/main/generated/torchaudio.prototype.SquimSubjective.html#forward

jfsantos · 2024-01-30T00:53:16Z

There is batch processing but if sequences are of different length and you end up having to pad them to be the same length, predictions are changed as masking is not supported. Are sequences of different length used during training? If so, is there any masking that could be introduced into the implementation for inference?

nateanl · 2024-01-30T03:24:58Z

During training all audio samples are truncated to 5 seconds. Masking is difficult to support for the Objective model since the model uses DPRNN as backbone, how to transpose the mask along with the RNN input needs to be considered.

DigitalPhoneme · 2024-02-27T15:05:09Z

Is there a way to fine-tune a squim subjective model with my own data? What kind of data would I have to use and how would I go about fine tuning (in a high level). Is there documentation

nateanl · 2024-02-29T15:27:51Z

@fullstackmedusa You need a dataset of paired waveforms and numerical labels (from 1 to 5), and another clean speech dataset as reference. You can find the details in https://arxiv.org/abs/2206.12285

balkce mentioned this issue Jan 9, 2025

SQUIM running in real-time #3870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch processing torchaudio-squim #3424

Batch processing torchaudio-squim #3424

bloodraven66 commented Jun 8, 2023

nateanl commented Jun 8, 2023

mthrok commented Jun 9, 2023 •

edited

Loading

jfsantos commented Jan 30, 2024

nateanl commented Jan 30, 2024

DigitalPhoneme commented Feb 27, 2024

nateanl commented Feb 29, 2024

Batch processing torchaudio-squim #3424

Batch processing torchaudio-squim #3424

Comments

bloodraven66 commented Jun 8, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context

nateanl commented Jun 8, 2023

mthrok commented Jun 9, 2023 • edited Loading

jfsantos commented Jan 30, 2024

nateanl commented Jan 30, 2024

DigitalPhoneme commented Feb 27, 2024

nateanl commented Feb 29, 2024

mthrok commented Jun 9, 2023 •

edited

Loading