-
Notifications
You must be signed in to change notification settings - Fork 5
Scope
This is minutes’ scope document.
To cluster and recognize different speakers in audio recorded conversations, produce transcriptions of these conversations, and label individual phrases with speakers and time stamps.
Given an audio recording (this is loosely defined for the time being) of a conversation with n
speakers, along with n
individual voice samples labelled by name (1 sample for each speaker i
in 0...n-1
), produce a list of phrases from the conversation - each phrase should have the following keys: speaker
, start_time
, end_time
and body
.
[
{
"speaker": "Alice",
"start_time": 148172245,
"end_time": 148172251,
"body": "Hello everyone, thanks for coming in. We have a lot to get through today so let’s get started."
},
{
"speaker": "Bob",
"start_time": 148172251,
"end_time": 148172253,
"body": "Happy to be here."
}
]
In general, phrases may overlap with one another in time, but no two phrases from the same speaker should overlap.
- A minimum of 90% accuracy on a cross validation test for MVP.
- A minimum of 99% accuracy on a cross validation test for production.
Speaker transcription is a largely solved problem. Google Speech and similar API’s provide near perfect speech recognition and transcription; speech recognition is not minutes’ goal, but it will likely leverage some of these tools.