Diglett

Diglett offers real-time speaker verification capabilities tailored for long conversations on the web.

Get speaker embedding: Each user record a 5 seconds audio sample and send the audio to the server through a RESTful API to get a speaker embedding for each user.
Start streaming audio and get speaker verification result: Open the websocket connection to the server and start streaming the conversation. The server will return the speaker verification result in real-time.

Features

Real-time Speaker verification: Powered by FastAPI and websocket. Support up to 2 people in a conversation.
Voice activity detection (VAD): Capable of identify segment of audio stream where no one is speaking.
Sound level detection: Detect loudness in real-time.
Stateless: Can scale up and down in cloud environment effortlessly.

How does it works?

Phase 1: Get Voice Signiture

Upload 5 seconds audio sample: Each user upload a 5 seconds recording of their voice as well as their name to the server through a RESTful API.
Generate voice signiture: Upon receiving the 5 seconds audio, the server will encode the audio with a EncoderClassifier and get a speaker embeddings as the voice signiture (Will later be used to identify the speaker)
Calculate the average volume: The server will also calculate the mean energy level as the average speaker volume.
Return voice signiture and average volume: Finally, the server will send back a HTTP response which contains the speaker's name, the speaker's voice embeddings, and the speaker's average volume.

Phase 2: Real-time Speaker Verification

Client stream the conversation: The client start record the conversation, at the same time, send each segment of the conversation as well as 2 speaker's embedding to the server in the background through a websocket connection.
Get the target signiture: Upon receiving every segment of audio conversation and 2 speaker's embeddings, the server will encode the target audio segment with EncoderClassifier and get a target embedding.
Find out the most similar speaker: Compare the similarity score (with Cosine similarity) between the target embedding and 2 speaker's embeddings to find out the most similar one. If the similarity score for both speaker's embedding is below a certain threshold, than we identify this audio segment as sound from a third party (either a third person, or both people speaking at the same time, or a silence).
Find out the average volume: The server will also calculate the average volume of the audio segment.
Send back the result: The server will stream back the speaker verification result as well as the average volume of the audio segment.

API Documentation

Get speaker embedding.
- URL: http://SERVER_IP:PORT/embed
- Request type: POST
- Resquest body: 5 seconds audio recording of a speaker
```
{
  "file": [bytes],
}
```
- Output: JSON
```
{
  "speaker_name": string,
  "speaker_embedding": float[],
  "avg_db": float,
}
```

Speaker verification streaming (websocket)

URL: ws://SERVER_IP:PORT/stream
Input: Continuous audio stream with the following JSON data.

{
  "audio_data": [base64 encoded bytes],
  "speaker_embedding": [speaker emb1, speaker emb2]
  "terminate_session": bool,
}

Output: Continuous labeled stream.

{
  "speaker": speaker_emb,
  "db": float,
}

Deploy with Docker

Build Docker image with the Dockerfile.

$ docker buildx build -t diglett .

Run Docker container.

$ docker run -d --restart unless-stopped --name diglett -p 3210:80 diglett:latest

Wait a few seconds for the application to download some machine learning model file in the container. You can check the download progress with docker logs -f diglett.
Interactive API documentation check http://localhost:3210/docs/

Setup local development environment

Install dependency portaudio19

$ sudo apt-get -y portaudio19-dev

Install poetry for Python package management.

$ pipx install poetry

Clone the repo and install dependencies with poetry.

# Clone and cd into the repo.
$ poetry install

Create your own .env to store sensitive information. (You can copy example.env and modify the content as you needed.)

$ cp example.env .env
# Edit .env

Run the development server.

$ uvicorn diglett.main:app --reload

Now you can do:

Run example Python client.

$ python -m example.client

Get the speaker embedding with curl.

$ curl -X POST -F "file=@/path/to/file.wav" SERVER_IP:PORT/embed

Run the test.

$ python -m pytest

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
diglett		diglett
example		example
frontend		frontend
img		img
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
example.env		example.env
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diglett

Features

How does it works?

Phase 1: Get Voice Signiture

Phase 2: Real-time Speaker Verification

API Documentation

Deploy with Docker

Setup local development environment

About

Languages

8igMac/diglett

Folders and files

Latest commit

History

Repository files navigation

Diglett

Features

How does it works?

Phase 1: Get Voice Signiture

Phase 2: Real-time Speaker Verification

API Documentation

Deploy with Docker

Setup local development environment

About

Resources

Stars

Watchers

Forks

Languages