Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processing torchaudio-squim #3424

Open
bloodraven66 opened this issue Jun 8, 2023 · 6 comments
Open

Batch processing torchaudio-squim #3424

bloodraven66 opened this issue Jun 8, 2023 · 6 comments

Comments

@bloodraven66
Copy link

🚀 The feature

This is regarding the objective and subjective metrics available as part of torchaudio-squim (https://pytorch.org/audio/main/tutorials/squim_tutorial.html#sphx-glr-tutorials-squim-tutorial-py).

Currently, it works only at batch size = 1. i.e, the waveforms are expected to be of shape (1, N). Can we have batch level processing?

Motivation, pitch

Researchers usually use these metrics on test sets and a range of model configurations. I'm also looking at using the subjective model on multiple non-matching references for a single audio. Batch processing will help a lot in speeding up the processing time.

Alternatives

No response

Additional context

No response

@nateanl
Copy link
Member

nateanl commented Jun 8, 2023

Hi @bloodraven66, thanks for trying the squim model for evaluation. Actually you can! Just pass a batch tensor to the model, it will generate scores also in batch.

@mthrok
Copy link
Collaborator

mthrok commented Jun 9, 2023

I guess we can update the tutorial so that it contains links to documentation, using :py:class:.

I had a bit of difficulty to find the answer to this, but the following seems to be the documentation.

https://pytorch.org/audio/main/generated/torchaudio.prototype.SquimSubjective.html#forward

@jfsantos
Copy link

There is batch processing but if sequences are of different length and you end up having to pad them to be the same length, predictions are changed as masking is not supported. Are sequences of different length used during training? If so, is there any masking that could be introduced into the implementation for inference?

@nateanl
Copy link
Member

nateanl commented Jan 30, 2024

During training all audio samples are truncated to 5 seconds. Masking is difficult to support for the Objective model since the model uses DPRNN as backbone, how to transpose the mask along with the RNN input needs to be considered.

@DigitalPhoneme
Copy link

Is there a way to fine-tune a squim subjective model with my own data? What kind of data would I have to use and how would I go about fine tuning (in a high level). Is there documentation

@nateanl
Copy link
Member

nateanl commented Feb 29, 2024

@fullstackmedusa You need a dataset of paired waveforms and numerical labels (from 1 to 5), and another clean speech dataset as reference. You can find the details in https://arxiv.org/abs/2206.12285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants