Data Generation

🎯 Goal: Generate a text dataset from a playlist of YouTube videos.

Preqrequisites

Deepgram account
OpenAI account
ffmpeg installed (e.g. sudo apt install ffmpeg)

sudo apt -y update
sudo apt -y install ffmpeg

Change directory into data-gen folder:

cd data-gen

1. Install Dependencies

pip install -r requirements.txt

2. Create a YouTube Playlist

For your convenience we'll use a playlist that we've already created.

Playlist Name	Playlist ID
Startup Interviews	`PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2`

If you'd like to create your own YouTube playlist, ensure your playlist is public or unlisted. Then extract the playlist ID using format: https://www.youtube.com/playlist?list={YOUR_PLAYLIST_ID}

3. Download YouTube videos as audio from the Playlist

The following command will download all videos from the playlist as audio files and split them into chapters, if available:

./download.sh REPLACE_WITH_PLAYLIST_ID

For example, using the Startup Interviews playlist:

./download.sh "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"

4. Transcribe Audio to Text

First, create a new Deepgram API key.

❓ How to create a new Deepgram API key?

Click API Keys in the left sidebar.

Click Create API Key button.

Select Member and click Create Key button.

Copy the API key.

Then, set your Deepgram API key as an environment variable:

export DEEPGRAM_API_KEY=REPLACE_WITH_YOUR_API_KEY

Now you can run the script to convert YouTube videos to text:

python3 transcribe.py REPLACE_WITH_PLAYLIST_ID

For example, using the Startup Interviews playlist:

python3 transcribe.py "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"

You now have .json files containing the transcrptions of each video in the playlist!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Generation

Preqrequisites

1. Install Dependencies

2. Create a YouTube Playlist

3. Download YouTube videos as audio from the Playlist

4. Transcribe Audio to Text

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Generation

Preqrequisites

1. Install Dependencies

2. Create a YouTube Playlist

3. Download YouTube videos as audio from the Playlist

4. Transcribe Audio to Text