RAG (Retrieval-Augmented Generation) system that extracts YouTube video transcripts, converts them to embeddings, and enables intelligent queries about the content using LLMs.
- Automatic ingestion: Extracts transcripts from individual videos or complete playlists
- OpenAI embeddings: Uses
text-embedding-3-small
for maximum semantic precision - Local database: SQLite for efficient storage and portability
- Smart search: Finds relevant content using cosine similarity
- Contextual responses: Generates accurate answers with GPT based on found content
- Temporal references: Includes timestamps and direct links to specific moments
git clone <repository-url>
cd youtube-rag-system
pip install -r requirements.txt
- Create your
.env
file:
cp .env.example .env
- Edit
.env
with your API keys:
YOUTUBE_API_KEY=your_youtube_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
YouTube API Key:
- Go to Google Cloud Console
- Create project or select existing one
- Enable YouTube Data API v3
- Create credentials (API Key)
OpenAI API Key:
- Go to OpenAI API
- Create new API key
Individual video:
python ingestion.py "https://youtu.be/aircAruvnKk"
Complete playlist:
python ingestion.py "https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi"
Options:
--db-path
: Database path (default:youtube_embeddings.db
)
Interactive mode:
python rag_query.py
Single query:
python rag_query.py --query "What is backpropagation in neural networks?"
Options:
--db-path
: Database path--top-k
: Number of chunks to retrieve (default: 5)--query
: Query in non-interactive mode
What is a neural network?
How does gradient descent work?
Explain the chain rule in calculus
What are the main components of a transformer?
What's the difference between supervised and unsupervised learning?
Machine Learning:
- 3Blue1Brown Neural Networks:
https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
- StatQuest:
https://youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF
- Extraction: Gets automatic/manual transcripts from YouTube
- Chunking: Splits transcripts into ~500 character chunks
- Embeddings: Converts each chunk to vector using OpenAI embeddings
- Storage: Saves in SQLite with metadata (timestamps, titles)
- Search: Finds similar chunks using cosine similarity
- Generation: Uses GPT to generate answers based on relevant context
Error: No transcript available
- Video doesn't have automatic transcripts enabled
- Try another video with captions
Error: API key not found
- Verify
.env
file exists - Confirm API keys are correctly configured
Error: Quota exceeded
- YouTube API: 10,000 requests/day (free)
- OpenAI API: Check limits in your account