This is the state-change relation extraction dataset of paper "Enhanced Distant Supervision with State-Change Information for Relation Extraction" in LREC 2022.
This git repo requires access to the LDC2011T07 (English Gigaword Fifth Edition) dataset, and we provide manual annotations for our usecase. In addition, we provide indices of the data and script so that the actual data can be retrieved from Gigaword corpus.
You can run the ./get_original.py
to get the actual sentences, tokens and subject-object names in sentences by giving the dataset directory as input path.
You can run:
conda create --name myenv python=3.9.12
conda activate myenv
conda install pip
pip install -r ./requirements.txt
conda install -c conda-forge cupy
python -m spacy download en_core_web_trf
python ./get_original.py <path_to_gigaword_unzipped_directory> <path_where_you_want_to_save_the_data>
In ./data
, there are 5 folders :
train
contains all the training data with the format - train_windowsize_relationtype.txt which contains all the positives referring to a particular relation. It also contains a train_negatives files which contains 10k fixed negatives which are to be used in addition to the positives for training each sceanrio.val
contains all the static validation data for each relation type.test
contains all the static test data for each relation type.dynamic_val
contains all the dynamic validation data for each relation type.dynamic_test
contains all the dynamic test data for each relation type.
- 0.txt represents P26@start
- 1.txt represents P26@end
- 3.txt represents P35@start
- 7.txt represents P54@start
- 5.txt represents P463@start
- Merging 0.txt and 1.txt represents P26
- 3.txt represents P35
- 7.txt represents P54
- 5.txt represents P463
The code is released under the under terms of the Apache-2.0 License.