Transcribify is a Python package that helps you download audio from YouTube, transcribe it using OpenAI's Whisper, generate embeddings with Sentence Transformers, and store/query data in LanceDB. Perfect for building search-enabled video and audio repositories.
- 🎥 Download Audio: Extract and download audio from YouTube videos.
- 📝 Transcribe: Convert audio files to text with OpenAI's Whisper.
- 🤖 Generate Embeddings: Use Sentence Transformers to create semantic embeddings of transcripts.
- 📦 Store & Query: Store transcripts and embeddings in LanceDB for efficient search and retrieval.
- 🔍 Search Transcripts: Query stored transcripts using semantic similarity.
Install Transcribify directly from PyPI:
pip install transcribify
- Python 3.7 or higher
- Dependencies:
whisper
sentence-transformers
lancedb
yt-dlp
click
Transcribify comes with a command-line interface for easy interaction.
transcribify download <youtube_url> --output_dir <output_directory>
Example:
transcribify download https://www.youtube.com/watch?v=abcd1234 --output_dir downloads
transcribify transcribe <audio_file>
Example:
transcribify transcribe downloads/example.mp3
transcribify process <audio_file> --db_path <path_to_db> --table_name <table_name>
Example:
transcribify process downloads/example.mp3 --db_path lancedb --table_name transcripts
transcribify search <query> --db_path <path_to_db> --table_name <table_name>
Example:
transcribify search "What is this video about?" --db_path lancedb --table_name transcripts
You can also use Transcribify programmatically in your Python scripts.
from transcribify.youtube_downloader import download_audio_from_youtube
from transcribify.transcriber import transcribe_audio
from transcribify.embedder import generate_embeddings
from transcribify.lancedb_manager import LanceDBManager
from transcribify.query_engine import query_lancedb
# Step 1: Download audio from YouTube
audio_path = download_audio_from_youtube("https://www.youtube.com/watch?v=abcd1234")
# Step 2: Transcribe audio
transcript = transcribe_audio(audio_path)
# Step 3: Generate embeddings
embeddings = generate_embeddings([transcript])
# Step 4: Store transcript and embeddings in LanceDB
db_manager = LanceDBManager(db_path="lancedb")
db_manager.insert_data(
table_name="transcripts",
data=[{"text": transcript, "embedding": embeddings[0]}]
)
# Step 5: Search for transcripts
results = query_lancedb("What is this video about?", db_path="lancedb", table_name="transcripts")
print(results)
This project is licensed under the MIT License. See the LICENSE file for details.
- OpenAI for the amazing Whisper transcription model.
- Hugging Face for the Sentence Transformers.
- LanceDB for the efficient database engine for machine learning use cases.
yt-dlp
for YouTube audio/video downloads.