This repository contains the datasets for the shared task of the Automatic Summarization for Creative Writing (Creative-Summ) workshop at COLING 2022.
More information can be found at https://creativesumm.github.io/sharedtask.
The CreativeSumm 2022 shared task is divided into four sub-tasks, namely:
- summarization of chapters from novels
- summarization of movie scripts
- summarization of primetime television transcripts
- summarization of daytime, “soap opera” transcripts
The training data for each sub-task comes from existing, well-established datasets (see below), but for the movie and television sub-tasks we will provide new, unseen test inputs for evaluation.
This dataset pairs chapters of novels released as part of Project Gutenberg with corresponding summaries. For this shared task, we provide the novel chapters here. We unfortunately cannot provide the summaries, as the study guide websites are copyrighted. Each novel chapter in the provided data, however, does have a link to the page where the summary text may be found.
Please see the associated papers Ladhak et al. (2020) and Kryściński et al. (2021) papers for more information on how they collected the summaries.
Notes:
- For the novel chapter summarization task, please do not use the test splits (of either NovelChapter or BookSum) for either training or development.
- Note that the provided links and alignments in this repo have excluded test set books -- ensure you do the same!
- We will use the test set of BookSum as part of the final evaluation. We may or may not provide new, unseen test inputs for the final evaluation as well.
This dataset pairs movie transcripts with their corresponding Wikipedia summaries. The data may be downloaded from here. See the main repository for additional information. We've split the dataset into train and validation splits, and the list of movies associated with each split can be found here.
NOTE: The input for this task is the movie script (the script.txt
file) and the target summary is the plain text synopsis from Wikipedia (the processed/wikiplot.txt
file).
This dataset pairs TV transcripts from primetime shows with their corresponding Wikipedia summaries. We will use the version of this data associated with the SCROLLS Benchmark (Shaham et al., 2022), and you may download the data there. Please see the notes below for important additional information!
Notes:
- You may use any part of this dataset for training, since we will be providing new inputs for evaluation.
- We recommend training on the training+validation sets, and using the test set for validation.
- Since the test set outputs are not easily downloaded from the SCROLLS website, we make them available here.
This dataset pairs soap opera transcripts with summaries written by TV Megasite contributors. We have preprocessed the data so that it is in the same format as the Forever Dreaming data (i.e., it follows SCROLLS conventions), and it may be downloaded here.
Notes:
- You may use any part of this dataset for training, since we will be providing new inputs for evaluation.
- We recommend training on the training+validation sets, and using the test set for validation.
Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel. 2022. SummScreen: A Dataset for Abstractive Screenplay Summarization. In ACL.
Philip John Gorinski, Mirella Lapata. 2015. Movie Script Summarization as Graph-based Scene Extraction. In NAACL.
Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev. 2021. BookSum: A Collection of Datasets for Long-form Narrative Summarization.
Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, Kathleen McKeown. 2020. Exploring Content Selection in Summarization of Novel Chapters. In ACL.