Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align Larger Audio File #13

Open
rajibul97 opened this issue Nov 3, 2021 · 2 comments
Open

Align Larger Audio File #13

rajibul97 opened this issue Nov 3, 2021 · 2 comments

Comments

@rajibul97
Copy link

rajibul97 commented Nov 3, 2021

Hi @cschaefer26,
You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.

Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.

@cschaefer26
Copy link
Collaborator

Hi, did you ensure that all the audio files were preprocessed before training? Because the preprocessing builds up a phoneme sett from the training data. I'd suspect that you apply the model to new files with unknown phonemes that get filtered out (that's just a guess).

@rajibul97 rajibul97 reopened this Dec 5, 2021
@rajibul97
Copy link
Author

Hi @cschaefer26 , your guess is correct. I applied the model with new files containing unknown phonemes. Thanks for your reply. However, when I want to align an audio (with intermediate silences which are actually inherent) and its phoneme sequence, the accuracy of predicted durations for phones is quite low. As intermediate silence parts are merged with phones' duration. Any suggestion please......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants