🎯 Goal: Generate a text dataset from a playlist of YouTube videos.
sudo apt -y update
sudo apt -y install ffmpeg
- Change directory into
data-gen
folder:
cd data-gen
pip install -r requirements.txt
For your convenience we'll use a playlist that we've already created.
Playlist Name | Playlist ID |
---|---|
Startup Interviews | PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2 |
If you'd like to create your own YouTube playlist, ensure your playlist is public or unlisted.
Then extract the playlist ID using format: https://www.youtube.com/playlist?list={YOUR_PLAYLIST_ID}
The following command will download all videos from the playlist as audio files and split them into chapters, if available:
./download.sh REPLACE_WITH_PLAYLIST_ID
For example, using the Startup Interviews
playlist:
./download.sh "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"
First, create a new Deepgram API key.
❓ How to create a new Deepgram API key?
- Click
API Keys
in the left sidebar.
- Click
Create API Key
button.
- Select
Member
and clickCreate Key
button.
- Copy the API key.
Then, set your Deepgram API key as an environment variable:
export DEEPGRAM_API_KEY=REPLACE_WITH_YOUR_API_KEY
Now you can run the script to convert YouTube videos to text:
python3 transcribe.py REPLACE_WITH_PLAYLIST_ID
For example, using the Startup Interviews
playlist:
python3 transcribe.py "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"
You now have .json
files containing the transcrptions of each video in the playlist!