Diglett offers real-time speaker verification capabilities tailored for long conversations on the web.
- Get speaker embedding: Each user record a 5 seconds audio sample and send the audio to the server through a RESTful API to get a speaker embedding for each user.
- Start streaming audio and get speaker verification result: Open the websocket connection to the server and start streaming the conversation. The server will return the speaker verification result in real-time.
- Real-time Speaker verification: Powered by FastAPI and websocket. Support up to 2 people in a conversation.
- Voice activity detection (VAD): Capable of identify segment of audio stream where no one is speaking.
- Sound level detection: Detect loudness in real-time.
- Stateless: Can scale up and down in cloud environment effortlessly.
- Upload 5 seconds audio sample: Each user upload a 5 seconds recording of their voice as well as their name to the server through a RESTful API.
- Generate voice signiture: Upon receiving the 5 seconds audio, the server will encode the audio with a EncoderClassifier and get a speaker embeddings as the voice signiture (Will later be used to identify the speaker)
- Calculate the average volume: The server will also calculate the mean energy level as the average speaker volume.
- Return voice signiture and average volume: Finally, the server will send back a HTTP response which contains the speaker's name, the speaker's voice embeddings, and the speaker's average volume.
- Client stream the conversation: The client start record the conversation, at the same time, send each segment of the conversation as well as 2 speaker's embedding to the server in the background through a websocket connection.
- Get the target signiture: Upon receiving every segment of audio conversation and 2 speaker's embeddings, the server will encode the target audio segment with EncoderClassifier and get a target embedding.
- Find out the most similar speaker: Compare the similarity score (with Cosine similarity) between the target embedding and 2 speaker's embeddings to find out the most similar one. If the similarity score for both speaker's embedding is below a certain threshold, than we identify this audio segment as sound from a third party (either a third person, or both people speaking at the same time, or a silence).
- Find out the average volume: The server will also calculate the average volume of the audio segment.
- Send back the result: The server will stream back the speaker verification result as well as the average volume of the audio segment.
-
Get speaker embedding.
- URL:
http://SERVER_IP:PORT/embed
- Request type:
POST
- Resquest body: 5 seconds audio recording of a speaker
{ "file": [bytes], }
- Output: JSON
{ "speaker_name": string, "speaker_embedding": float[], "avg_db": float, }
- URL:
-
Speaker verification streaming (websocket)
- URL:
ws://SERVER_IP:PORT/stream
- Input: Continuous audio stream with the following JSON data.
{ "audio_data": [base64 encoded bytes], "speaker_embedding": [speaker emb1, speaker emb2] "terminate_session": bool, }
- Output: Continuous labeled stream.
{ "speaker": speaker_emb, "db": float, }
- URL:
- Build Docker image with the Dockerfile.
$ docker buildx build -t diglett .
- Run Docker container.
$ docker run -d --restart unless-stopped --name diglett -p 3210:80 diglett:latest
- Wait a few seconds for the application to download some
machine learning model file in the container. You can check
the download progress with
docker logs -f diglett
. - Interactive API documentation check http://localhost:3210/docs/
- Install dependency
portaudio19
$ sudo apt-get -y portaudio19-dev
- Install poetry for Python package management.
$ pipx install poetry
- Clone the repo and install dependencies with poetry.
# Clone and cd into the repo.
$ poetry install
- Create your own
.env
to store sensitive information. (You can copyexample.env
and modify the content as you needed.)
$ cp example.env .env
# Edit .env
- Run the development server.
$ uvicorn diglett.main:app --reload
- Now you can do:
- Run example Python client.
$ python -m example.client
- Get the speaker embedding with
curl
.
$ curl -X POST -F "file=@/path/to/file.wav" SERVER_IP:PORT/embed
- Run the test.
$ python -m pytest