Skip to content

Latest commit

 

History

History
92 lines (59 loc) · 2.41 KB

README.md

File metadata and controls

92 lines (59 loc) · 2.41 KB

Data Generation

🎯 Goal: Generate a text dataset from a playlist of YouTube videos.

Preqrequisites

sudo apt -y update
sudo apt -y install ffmpeg
  • Change directory into data-gen folder:
cd data-gen

1. Install Dependencies

pip install -r requirements.txt

2. Create a YouTube Playlist

For your convenience we'll use a playlist that we've already created.

Playlist Name Playlist ID
Startup Interviews PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2

If you'd like to create your own YouTube playlist, ensure your playlist is public or unlisted. Then extract the playlist ID using format: https://www.youtube.com/playlist?list={YOUR_PLAYLIST_ID}

3. Download YouTube videos as audio from the Playlist

The following command will download all videos from the playlist as audio files and split them into chapters, if available:

./download.sh REPLACE_WITH_PLAYLIST_ID

For example, using the Startup Interviews playlist:

./download.sh "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"

4. Transcribe Audio to Text

First, create a new Deepgram API key.

❓ How to create a new Deepgram API key?
  1. Click API Keys in the left sidebar.

image

  1. Click Create API Key button.

image

  1. Select Member and click Create Key button.

image

  1. Copy the API key.

Then, set your Deepgram API key as an environment variable:

export DEEPGRAM_API_KEY=REPLACE_WITH_YOUR_API_KEY

Now you can run the script to convert YouTube videos to text:

python3 transcribe.py REPLACE_WITH_PLAYLIST_ID

For example, using the Startup Interviews playlist:

python3 transcribe.py "PLZQTcICOilg6c4DXPE9LOGnFgUszSBGg2"

You now have .json files containing the transcrptions of each video in the playlist!