Skip to content
Chad edited this page Sep 21, 2017 · 11 revisions

Scope

This is minutes’ scope document.

Vision

To cluster and recognize different speakers in audio recorded conversations, produce transcriptions of these conversations, and label individual phrases with speakers and time stamps.

Minimum Viable Scope

Given an audio recording (this is loosely defined for the time being) of a conversation with n speakers, along with n individual voice samples labelled by name (1 sample for each speaker i in 0...n-1), produce a list of phrases from the conversation - each phrase should have the following keys: speaker, start_time, end_time and body.

Example Output

[
    {
          "speaker": "Alice",
          "start_time": 148172245,
          "end_time": 148172251,
          "body": "Hello everyone, thanks for coming in. We have a lot to get through today so let’s get started."
    },
    {
          "speaker": "Bob",
          "start_time": 148172251,
          "end_time": 148172253,
          "body": "Happy to be here."
    }
]

In general, phrases may overlap with one another in time, but no two phrases from the same speaker should overlap.

Miss-Classification Rates

  • A minimum of 90% accuracy on a cross validation test for MVP.
  • A minimum of 99% accuracy on a cross validation test for production.

Anti-vision

Speaker transcription is a largely solved problem. Google Speech and similar API’s provide near perfect speech recognition and transcription; speech recognition is not minutes’ goal, but it will likely leverage some of these tools.

Clone this wiki locally