The Self-dialogue Corpus

This is an early release of the Self-dialogue Corpus containing 24,165 conversations, or 3,653,313 words, across 23 topics.

Using the data

corpus contains the raw CSVs from Amazon Mechanical Turk, sorted by individual tasks (topics);
blocked_workers.txt lists workers who did not comply with the requirements of the tasks, these are omitted by default;
get_data.py is a preprocessing script which will format the CSVs into text, along with various options (see below).

`get_data.py`

Example usage: python get_data.py corpus formatted_corpus.

Optional arguments:

--output-naming whether to name output files with integers (integer) or by assignment_id (assignment_id);
--remove-punctuation removes punctuation from the output;
--set-case sets case of output to original, upper or lower;
--exclude-topic excludes any of the topics (or subdirectories of corpus), e.g. --exclude-topic music;
--include-only includes only the given topics, e.g. --include-only music.

Citation

For research using this data, please cite:

@article{krause2017edina,
  title={Edina: Building an Open Domain Socialbot with Self-dialogues},
  author={Krause, Ben and Damonte, Marco and Dobre, Mihai and Duma, Daniel and Fainberg, Joachim and Fancellu, Federico and Kahembwe, Emmanuel and Cheng, Jianpeng and Webber, Bonnie},
  journal={arXiv preprint arXiv:1709.09816},
  year={2017}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
corpus		corpus
LICENSE		LICENSE
README.md		README.md
blocked_workers.txt		blocked_workers.txt
get_data.py		get_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Self-dialogue Corpus

Using the data

`get_data.py`

Citation

About

Releases

Packages

Languages

License

SoumiDas/self_dialogue_corpus

Folders and files

Latest commit

History

Repository files navigation

The Self-dialogue Corpus

Using the data

get_data.py

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`get_data.py`

Packages