This American Life Audio Dataset Downloader

This script serves for downloading audio data used in this paper [1]. The data set consists of 663 podcasts from the This American Life radio program from 1995 to 2020, covering 637 hours of audio and an average of 18 unique speakers per conversation.

I've included multiple sources from which this script looks for files, because many of the links provided by the authors are dead.

The data is divided in train, valid and test folders. There is an option for converting the original mp3 files to wav.

Since audio files are copyrighted, they can't be distributet. Therefore, you can have these audios only as private dataset. Main goal of this notebook is making a private dataset on Kaggle. It can also be used for downloading data localy. Since data set sizes on Kaggle are limited to 20GB, I've included the option for spliting the data in four parts so it can fit when in wav format.

[1] Mao, H. H., Li, S., McAuley, J., & Cottrell, G. (2020). Speech Recognition and Multi-Speaker Diarization of Long Conversations. INTERSPEECH. https://arxiv.org/pdf/2005.08072.pdf

Requirements

Python 3.8

Installation

apt-get update && apt-get install -y ffmpeg
git clone https://github.com/jovistos/TALAD
cd TALAD
activate your virtualenv
pip install -r requirements.txt

Examples

#download the test set
python3 TAL_download_audio.py -p <apsolute_folder_path_to_download_the_data_in> -d test

#download the test, valid and train, and convert files to wav
python3 TAL_download_audio.py -p <apsolute_folder_path_to_download_the_data_in> -d test valid train -w True

#download the first part of train dataset and convert to wav (less than 20GB) (has 4 parts)
python3 TAL_download_audio.py -p <apsolute_folder_path_to_download_the_data_in> -d train_part_1 -w True

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
source		source
train_test_valid_split		train_test_valid_split
.gitignore		.gitignore
README.rst		README.rst
TAL_download_audio.py		TAL_download_audio.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This American Life Audio Dataset Downloader

Requirements

Installation

Examples

About

Releases

Packages

Languages

jovistos/TALAD

Folders and files

Latest commit

History

Repository files navigation

This American Life Audio Dataset Downloader

Requirements

Installation

Examples

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages