Skip to content

prabhav999/-Real-Time-AI-assistant-with-Voice-and-Vision-Capability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time AI Assistant with Voice and Vision

A real-time AI assistant powered by voice and vision capabilities. This project leverages state-of-the-art technologies including Deepgram, LiveKit Agents, OpenAI API, and Silero for seamless interaction and intelligent insights.


Features

  • Voice Recognition & Transcription: Utilizes Deepgram for accurate speech-to-text conversion in real-time.
  • Conversational AI: Powered by LiveKit Agents and OpenAI API to provide intelligent and context-aware responses.
  • Vision Capabilities: Integrates Silero for real-time image and video processing, enabling visual data analysis.
  • Real-Time Interaction: Combines voice and vision for a truly interactive and immersive experience.

Technologies Used

  • Deepgram: For real-time speech-to-text transcription.
  • LiveKit Agents: To manage AI agents for conversation and task automation.
  • OpenAI API: To process natural language and generate intelligent responses.
  • Silero: For vision-based tasks, including object detection and scene understanding.

How it works

This AI assistant combines voice recognition, natural language processing, and vision-based capabilities to provide a seamless real-time experience. Here's a detailed step-by-step breakdown:

Voice Input

The user speaks into the microphone. The audio input is processed in real-time using Deepgram: Converts the speech into text with high accuracy. Handles different accents, noise levels, and languages (if configured). The transcribed text is passed to the AI agent for further processing. Example: User says: "What's the capital of Japan?" Transcribed Text: "What's the capital of Japan?"

AI Response

The transcribed text is sent to the OpenAI API for natural language understanding and response generation: The assistant understands the context and intent behind the query. Generates a well-informed and conversational response. The response is converted back into speech (text-to-speech) for user feedback using Silero. Example: Query: "What's the capital of Japan?" OpenAI Response: "The capital of Japan is Tokyo." Output (via Silero): The assistant speaks, "The capital of Japan is Tokyo."

Visual Input

The system can process visual data like images or video streams in real-time: Visual inputs are analyzed using Silero for tasks like object detection, image classification, or scene recognition. The assistant extracts meaningful insights and communicates them to the user. Vision analysis can run alongside voice-based interactions for a richer experience. Example: User uploads: An image of a fruit basket. System Output: "I see apples, bananas, and oranges in the image."

Real-Time Interaction

The true strength of this assistant lies in its ability to combine voice and vision seamlessly:

Users can ask questions based on both audio and visual inputs. The assistant processes these inputs simultaneously to provide contextual and intelligent responses. Example Scenario: User says: "What's in this image?" while uploading an image of a street. Assistant Response: Vision Analysis: "I detect cars, pedestrians, and a stop sign in the image." Additional Query: "Would you like me to calculate the pedestrian density?" 5. Backend Workflow Behind the scenes, the workflow operates as follows:

Voice Processing:

Microphone input → Deepgram API → Text output. NLP Processing: Text input → OpenAI API → Context-aware response.

Vision Processing:

Image/video input → Silero → Object detection, scene analysis, or classification.

Output Generation:

Text response → Silero text-to-speech → User-friendly audio feedback.

Multimodal Query Handling

The assistant is designed to handle multimodal queries where voice and vision inputs are integrated:

Users can interact naturally, combining spoken instructions with visual context.

This makes it ideal for use cases such as: Assisting visually impaired individuals with image descriptions. Analyzing CCTV feeds and answering questions about live scenes. Example: User says: "Can you describe this area?" while pointing the camera at a park. Assistant Output: Voice: "I see a park with trees, people walking, and a playground." Vision: Highlights specific objects like benches, paths, or signs if additional analysis is needed.

Security

  • Do not expose your .env file or API keys in public repositories.
  • Use environment variables for production deployment.
  • Open source models can also be used for deployment.

Installing dependencies

creating a virtual environment is recommended, update pip, and install the required packages:

  • $ python3 -m venv .venv
  • $ source .venv/bin/activate
  • $ pip install -U pip
  • $ pip install -r requirements.txt

Save the requirements.txt file in the root of your main project directory and install the required dependencies!!

Contributions are welcome! Feel free to submit a pull request or open an issue for bugs and feature requests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages