YouTube RAG System

RAG (Retrieval-Augmented Generation) system that extracts YouTube video transcripts, converts them to embeddings, and enables intelligent queries about the content using LLMs.

Features

Automatic ingestion: Extracts transcripts from individual videos or complete playlists
OpenAI embeddings: Uses text-embedding-3-small for maximum semantic precision
Local database: SQLite for efficient storage and portability
Smart search: Finds relevant content using cosine similarity
Contextual responses: Generates accurate answers with GPT based on found content
Temporal references: Includes timestamps and direct links to specific moments

Installation

git clone <repository-url>
cd youtube-rag-system
pip install -r requirements.txt

Configuration

Create your .env file:

cp .env.example .env

Edit .env with your API keys:

YOUTUBE_API_KEY=your_youtube_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

Getting API Keys

YouTube API Key:

Go to Google Cloud Console
Create project or select existing one
Enable YouTube Data API v3
Create credentials (API Key)

OpenAI API Key:

Go to OpenAI API
Create new API key

Usage

1. Feed the database

Individual video:

python ingestion.py "https://youtu.be/aircAruvnKk"

Complete playlist:

python ingestion.py "https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi"

Options:

--db-path: Database path (default: youtube_embeddings.db)

2. Query the system

Interactive mode:

python rag_query.py

Single query:

python rag_query.py --query "What is backpropagation in neural networks?"

Options:

--db-path: Database path
--top-k: Number of chunks to retrieve (default: 5)
--query: Query in non-interactive mode

Example queries

What is a neural network?
How does gradient descent work?
Explain the chain rule in calculus
What are the main components of a transformer?
What's the difference between supervised and unsupervised learning?

How it works

Extraction: Gets automatic/manual transcripts from YouTube
Chunking: Splits transcripts into ~500 character chunks
Embeddings: Converts each chunk to vector using OpenAI embeddings
Storage: Saves in SQLite with metadata (timestamps, titles)
Search: Finds similar chunks using cosine similarity
Generation: Uses GPT to generate answers based on relevant context

Troubleshooting

Error: No transcript available

Video doesn't have automatic transcripts enabled
Try another video with captions

Error: API key not found

Verify .env file exists
Confirm API keys are correctly configured

Error: Quota exceeded

YouTube API: 10,000 requests/day (free)
OpenAI API: Check limits in your account

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ingestion.py		ingestion.py
rag_query.py		rag_query.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YouTube RAG System

Features

Installation

Configuration

Getting API Keys

Usage

1. Feed the database

2. Query the system

Example queries

Recommended test videos

How it works

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

IramML/youtube-knowledge-base

Folders and files

Latest commit

History

Repository files navigation

YouTube RAG System

Features

Installation

Configuration

Getting API Keys

Usage

1. Feed the database

2. Query the system

Example queries

Recommended test videos

How it works

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages