TACRED Enrichment

Why TACRED-enrichment?

The TACRED dataset is a leading corpus for Relation Extraction model development, analysis and benchmarking (see, for example, "Relation extraction on TACRED" ).

TACRED, available for download from the LDC TACRED webpage in json format, contains spans for the subject and object identified in each sentence, and one of the 42 entity relation attributes, or a no_relation classification if no relation exists. Additionally, TACRED contains contains POS tags, named entities and UD parses by the Standford CORENLP parser.

The aim of this project is to support enrichment of the TACRED dataset with additional semantic and syntactic attributes, where such additional attributes are required by downstream models.

Enrichment Steps

Two enrichment steps are implemented:

UCCA

The Universal Conceptual Cognitive Annotation framework is a multi-layered system for semantic representation that seeks to capture the semantic, rather than syntactic patterns, expressed through linguistic utterances.

The ucca_enrichment module accepts TACRED json as input, producing json output with all original properties, and a set of new properties relevant for UCCA.

CoreNLP

TUPA, the standard UCCA parser, uses the the SpaCy NLP pipeline for basic NLP tasks, including tokenization. SpaCy's default tokenization results vary considerably from TACRED's given tokenization. Additionally, the TACRED dataset contains some obvious tokenization errors; for example there are over 130 entries in which two sentences have been merged into one, by erroneously fusing the last token of the first sentence, it's period punctuation mark, and the first one of the next sentence into a single token.

To address these tokenization concerns, the second enrichment step re-parses all TACRED sentences with the Standford CORENLP parser, configuring it to adhere to the given tokenization produced by the TUPA parser. This results in a tokenization-aligned corenlp_pos, corenlp_head, and corenlp_ner attribute set.

Prerequisites

The modules have been tested on the following environment:

Debian 10 (will work on other flavors of Linux) with at least 16G of RAM
Python 3.7.3
OpenJDK 11.0.8 64bits (for the CoreNLP server)
CUDA version 10.0 and 10.1
RTX 2070 and RTX 2080

Environment Setup

It is strongly recommended to following the setup steps without deviation. Make sure to replace /path/to/virtual/env and /target/dir with your directories of choice.

python3 -m venv /path/to/virtual/env

source /path/to/virtual/env/bin/activate

pip install --upgrade pip

pip install wheel

pip install git+https://github.com/yyellin/tacred-enrichment.git

wget -O /target/dir/stanford-corenlp-full-2018-10-05.zip  http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

unzip /target/dir/stanford-corenlp-full-2018-10-05.zip -d /target/dir/

mkdir /target/dir/tupa-model ; cd /target/dir/tupa-model; curl -LO https://github.com/huji-nlp/tupa/releases/download/v1.4.0/bert_multilingual_layers_4_layers_pooling_weighted_align_sum.tar.gz; cd -

tar -zxvf /target/dir/tupa-model/bert_multilingual_layers_4_layers_pooling_weighted_align_sum.tar.gz -C /target/dir/tupa-model

Run Enrichment

Setup

Download the TACRED JSON files. For the purpose of these instructions /target/dir/data will be the designated directory; replace with your directory of choice.
Ensure that you have activated the virtual env by running: source /path/to/virtual/env/bin/activate

Step 1 - JSON to "JSON line"

Convert the original JSON file format into a "JSON line" format, in which there is one valid JSON value per line, each line representing a single sentence.

python -m tacred_enrichment.extra.json_to_lines_of_json --input /target/dir/data/train.json  --output /target/dir/data/train

python -m tacred_enrichment.extra.json_to_lines_of_json --input /target/dir/data/dev.json  --output /target/dir/data/dev

python -m tacred_enrichment.extra.json_to_lines_of_json --input /target/dir/data/test.json  --output /target/dir/data/test

Step 2 - UCCA Enrichment

Produce a set of "JSON line" files in which each sentence contains the UCCA properties.

python -m tacred_enrichment.ucca_enrichment /target/dir/tupa-model/bert_multilingual_layers_4_layers_pooling_weighted_align_sum --input /target/dir/data/train --output /target/dir/data/train1

python -m tacred_enrichment.ucca_enrichment /target/dir/tupa-model/bert_multilingual_layers_4_layers_pooling_weighted_align_sum --input /target/dir/data/dev --output /target/dir/data/dev1

python -m tacred_enrichment.ucca_enrichment /target/dir/tupa-model/bert_multilingual_layers_4_layers_pooling_weighted_align_sum --input /target/dir/data/test --output /target/dir/data/test1

Note: on my setup step 2 takes around 21 hours to complete

Step 3 - CoreNLP Enrichment

Produce a second set of "JSON line" files in which each sentence contains CoreNLP properties using the UCCA parser's tokenization Choose an available port for the CoreNLP server; in the command below I use port 9000. Note that the java command is run with an ampersand, so the StanfordCoreNLPServer is available for the following python commands.

java -Djava.net.preferIPv4Stack=true  -cp '/target/dir/stanford-corenlp-full-2018-10-05/*' edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -threads 2 -maxCharLength 100000 > /dev/null &

python -m tacred_enrichment.corenlp_enrichment localhost 9000 --lines --input /target/dir/data/train1 --output /target/dir/data/train2

python -m tacred_enrichment.corenlp_enrichment localhost 9000 --lines --input /target/dir/data/dev1 --output /target/dir/data/dev2

python -m tacred_enrichment.corenlp_enrichment localhost 9000 --lines --input /target/dir/data/test1 --output /target/dir/data/test2

Note: on my setup step 3 takes around 3.5 hours to complete

Step 4 - "JSON line" to JSON

Convert the "JSON line" format back into standard JSON. Backup your original train.json, dev.json and test.json, as the following steps will overwrite them:

python -m tacred_enrichment.extra.lines_of_json_to_json --input /target/dir/data/train2 --output /target/dir/data/train.json

python -m tacred_enrichment.extra.lines_of_json_to_json --input /target/dir/data/dev2 --output /target/dir/data/dev.json

python -m tacred_enrichment.extra.lines_of_json_to_json --input /target/dir/data/test2 --output /target/dir/data/test.json

License

All work contained in this package is licensed under the Apache License, Version 2.0.

Note

While some of the Python code implementation contains adequate documentation, some parts do not. I am happy to answer questions from all potential users of this module.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
tacred_enrichment		tacred_enrichment
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TACRED Enrichment

Why TACRED-enrichment?

Enrichment Steps

UCCA

CoreNLP

Prerequisites

Environment Setup

Run Enrichment

Setup

Step 1 - JSON to "JSON line"

Step 2 - UCCA Enrichment

Step 3 - CoreNLP Enrichment

Step 4 - "JSON line" to JSON

License

Note

About

Languages

yyellin/tacred-enrichment

Folders and files

Latest commit

History

Repository files navigation

TACRED Enrichment

Why TACRED-enrichment?

Enrichment Steps

UCCA

CoreNLP

Prerequisites

Environment Setup

Run Enrichment

Setup

Step 1 - JSON to "JSON line"

Step 2 - UCCA Enrichment

Step 3 - CoreNLP Enrichment

Step 4 - "JSON line" to JSON

License

Note

About

Topics

Resources

Stars

Watchers

Forks

Languages