Skip to content

YouTube RAG System - Extract YouTube video transcripts, convert to OpenAI embeddings, and query content using semantic search + GPT. Supports individual videos and playlists. Built with SQLite, cosine similarity search, and contextual AI responses with timestamp references.

Notifications You must be signed in to change notification settings

IramML/youtube-knowledge-base

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube RAG System

RAG (Retrieval-Augmented Generation) system that extracts YouTube video transcripts, converts them to embeddings, and enables intelligent queries about the content using LLMs.

Features

  • Automatic ingestion: Extracts transcripts from individual videos or complete playlists
  • OpenAI embeddings: Uses text-embedding-3-small for maximum semantic precision
  • Local database: SQLite for efficient storage and portability
  • Smart search: Finds relevant content using cosine similarity
  • Contextual responses: Generates accurate answers with GPT based on found content
  • Temporal references: Includes timestamps and direct links to specific moments

Installation

git clone <repository-url>
cd youtube-rag-system
pip install -r requirements.txt

Configuration

  1. Create your .env file:
cp .env.example .env
  1. Edit .env with your API keys:
YOUTUBE_API_KEY=your_youtube_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

Getting API Keys

YouTube API Key:

  1. Go to Google Cloud Console
  2. Create project or select existing one
  3. Enable YouTube Data API v3
  4. Create credentials (API Key)

OpenAI API Key:

  1. Go to OpenAI API
  2. Create new API key

Usage

1. Feed the database

Individual video:

python ingestion.py "https://youtu.be/aircAruvnKk"

Complete playlist:

python ingestion.py "https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi"

Options:

  • --db-path: Database path (default: youtube_embeddings.db)

2. Query the system

Interactive mode:

python rag_query.py

Single query:

python rag_query.py --query "What is backpropagation in neural networks?"

Options:

  • --db-path: Database path
  • --top-k: Number of chunks to retrieve (default: 5)
  • --query: Query in non-interactive mode

Example queries

What is a neural network?
How does gradient descent work?
Explain the chain rule in calculus
What are the main components of a transformer?
What's the difference between supervised and unsupervised learning?

Recommended test videos

Machine Learning:

  • 3Blue1Brown Neural Networks: https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
  • StatQuest: https://youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF

How it works

  1. Extraction: Gets automatic/manual transcripts from YouTube
  2. Chunking: Splits transcripts into ~500 character chunks
  3. Embeddings: Converts each chunk to vector using OpenAI embeddings
  4. Storage: Saves in SQLite with metadata (timestamps, titles)
  5. Search: Finds similar chunks using cosine similarity
  6. Generation: Uses GPT to generate answers based on relevant context

Troubleshooting

Error: No transcript available

  • Video doesn't have automatic transcripts enabled
  • Try another video with captions

Error: API key not found

  • Verify .env file exists
  • Confirm API keys are correctly configured

Error: Quota exceeded

  • YouTube API: 10,000 requests/day (free)
  • OpenAI API: Check limits in your account

About

YouTube RAG System - Extract YouTube video transcripts, convert to OpenAI embeddings, and query content using semantic search + GPT. Supports individual videos and playlists. Built with SQLite, cosine similarity search, and contextual AI responses with timestamp references.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages