This is the official page for the paper SuMe: A Dataset Towards Summarizing Biomedical Mechanisms , accepted at LREC2022.
SuMe is the first dataset towards summarizing biomedical mechanisms and the underlying relations between entities. The dataset contains 22K mechanism summarization instances collected semi-automatically and an evaluation partition of 125 instances that were corrected by domain experts. In addition it contains larger set of 611K abstracts for conclusion generation which we use as a pretraining task for mechanism generation models.
In the following example we see an example of an entry in the SuMe dataset. Some supporting text was removed to save space. The input is the supporting sentences with the main two entities. The output is the relation type and a sentence concluding the mechanism underlying the relationship.
We construct SuMe using biomedical abstracts from the PubMed open access subset. Starting from 1.1M scientific papers, we followed the following sequence of bootstrapping steps to prepare the SuMe dataset.
- Finding Conclusion Sentences
- Extracting Main Entities & Relation. We run biomedical relation extractor, REACH which can identify entities and the relations between entities.
- Filtering for Mechanism Sentences We separate out the abstracts for which the conclusion sentences are predicted to have non-mechanism related conclusions as additional related data that can be use for pretraining the generation models we eventually train for the mechanism summarization task. Dataset Statistics: Each dataset contains a number of unique abstracts, a supporting set, a mechanism sentence a pair of entities. The first entity is called the regulator entity (regulator) and the second one is called the regulated entity (regulated)
The dataset contains four different subsets.
The training set with about 21k abstracts. You can download training set from here.
The validation set with about 1k abstract which the hyperparameters are tuned with can be found here.
The test sets is accessible via this link here
The best model, which is pretrained with pretraining data and then fine tuned on training set is accessible here.
The dataset is collected using open source NIH active directory for PMC papers. We generally follow their license as mentioned here
Please use the following bibtex entry:
@inproceedings{bastan-etal-2022-sume,
title = "{S}u{M}e: A Dataset Towards Summarizing Biomedical Mechanisms",
author = "Bastan, Mohaddeseh and
Shankar, Nishant and
Surdeanu, Mihai and
Balasubramanian, Niranjan",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.748",
pages = "6922--6931",
}