Skip to content

The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

License

Notifications You must be signed in to change notification settings

SoumiDas/self_dialogue_corpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Self-dialogue Corpus

This is an early release of the Self-dialogue Corpus containing 24,165 conversations, or 3,653,313 words, across 23 topics.

Using the data

  • corpus contains the raw CSVs from Amazon Mechanical Turk, sorted by individual tasks (topics);
  • blocked_workers.txt lists workers who did not comply with the requirements of the tasks, these are omitted by default;
  • get_data.py is a preprocessing script which will format the CSVs into text, along with various options (see below).

get_data.py

Example usage: python get_data.py corpus formatted_corpus.

Optional arguments:

  • --output-naming whether to name output files with integers (integer) or by assignment_id (assignment_id);
  • --remove-punctuation removes punctuation from the output;
  • --set-case sets case of output to original, upper or lower;
  • --exclude-topic excludes any of the topics (or subdirectories of corpus), e.g. --exclude-topic music;
  • --include-only includes only the given topics, e.g. --include-only music.

Citation

For research using this data, please cite:

@article{krause2017edina,
  title={Edina: Building an Open Domain Socialbot with Self-dialogues},
  author={Krause, Ben and Damonte, Marco and Dobre, Mihai and Duma, Daniel and Fainberg, Joachim and Fancellu, Federico and Kahembwe, Emmanuel and Cheng, Jianpeng and Webber, Bonnie},
  journal={arXiv preprint arXiv:1709.09816},
  year={2017}
}

About

The Self-dialogue Corpus - a collection of self-dialogues across music, movies and sports

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%