-
Notifications
You must be signed in to change notification settings - Fork 5
Scope
This is minutes’ scope document.
To cluster and recognize different speakers in audio recorded conversations, produce transcriptions of these conversations, and label individual phrases with speakers and time stamps.
Given an audio recording (this is loosely defined for the time being) of a conversation with n
speakers, identify which speakers spoke which phrases and produce a list of phrases from the conversation - each phrase should have the following keys: speaker
, start_time
, end_time
and body
.
[
{
"speaker": 0,
"start_time": 148172245,
"end_time": 148172251,
"body": "Hello everyone, thanks for coming in. We have a lot to get through today so let’s get started."
},
{
"speaker": 1,
"start_time": 148172251,
"end_time": 148172253,
"body": "Happy to be here."
}
]
In general, phrases may overlap with one another in time, but no two phrases from the same speaker should overlap.
- A minimum of 90% accuracy on an out of sample cross validation test for MVP.
- A minimum of 99% accuracy on an out of sample cross validation test for production.
Voice-to-text is a largely solved problem. Google Speech and similar API’s provide near perfect speech recognition and transcription; therefore speech recognition is not minutes’ goal, but it will likely leverage some of these tools.