Relation extraction (RE) is the task of discovering entities' relations in weakly structured text. There is a lot of applications for RE such as knowledge-base population, question answering, summarization and so on. However, despite the increasing number of studies, there is a lack of cross-domain evaluation researches. The purpose of this work is to explore how models can be adapted to the changing types of entities.
There are several ways to deal with the changing types of entities:
-
Fine-tuning
We can retrain our model on the new obtained data, but the main problem is to get and annotate new documents. For more details please refer to following studies: Domain Adaptation for Relation Extraction with Domain Adversarial Neural Network and Instance Weighting for Domain Adaptation in NLP.
-
Ignoring
Another way is to build a model that does not use any information of entities' types and just mark entities in the text with special tokens (e.g. [SUBJ] and [/SUBJ], [OBJ] and [/OBJ], etc.). For more details please refer to An Improved Baseline for Sentence-level Relation Extraction and A Generative Model for Relation Extraction and Classification.
-
Mapping
Reasonable way is to build a mapping from the model's entity types to another domain ones. But there may be situations when it is impossible to build the unambiguous mapping (e.g. diagram below, where new type
NUMBER
corresponds to 5 old ones).In the case of the unambiguous mapping, we can try all suitable mappings, but if there are
$N$ entities and$M$ candidates for each of them,$M^N$ model runs are required. -
Diversified training
The last method is called diversified training. Its key point is to change original entity types with the corresponding synonyms.
flowchart TB
subgraph a["New unknown entity type"]
subgraph b[" "]
num([NUMBER])
end
end
subgraph c["Familiar entity types"]
subgraph d[" "]
num(["NUMBER"]) --> quantity(["QUANTITY"])
num(["NUMBER"]) --> percent(["PERCENT"])
num(["NUMBER"]) --> ordinal(["ORDINAL"])
num(["NUMBER"]) --> cardinal(["CARDINAL"])
num(["NUMBER"]) --> money(["MONEY"])
end
end
The 2 main requirements of this work are:
- adaptation method applicability without additional data;
- immutability of the speed of the model work.
Therefore, we will only compare Ignoring
and Diversified training
methods
Adaptation methods | Results, F1-score (%) | ||
---|---|---|---|
DocRED | Re-TACRED | ||
SSAN-Adapt | DocUNet | BERT_base | |
Ignoring | 54.32 ± 0.05 | 60.24 ± 0.08 | 76.64 ± 0.37 |
Diversified training | 51.62 ± 0.16 | 60.13 ± 0.08 | 74.96 ± 0.20 |
As we can see ignoring method has the best results.
The base classes are divided into 4 main categories:
- Examples' features:
- Word
- Mention
- FactClass
- AbstractFact
- EntityFact
- RelationFact
- Examples:
- Document
- PreparedDocument
- AbstractDataset
- Models:
- AbstractModel
- Utilities:
- ModelManager
- AbstractLoader
classDiagram
directionTB
ModelManager "1" --> "1" AbstractModel : init, train and evaluate
ModelManager "1" --> "1" AbstractLoader : use to load documents
AbstractModel ..> AbstractDataset : use
AbstractDataset "1" o-- "1..*" Document : process docs
AbstractDataset "1" o-- "1..*" PreparedDocument : convert to prepared docs
AbstractModel <|-- SSANAdapt
AbstractModel <|-- DocUNet
AbstractModel <|-- BertBaseline
AbstractLoader <|-- DocREDLoader
AbstractLoader <|-- TacredLoader
class ModelManager{
+config: ManagerConfig
+loader: AbstractLoader
+model: AbstractModel
+train()
+evaluate()
+predict()
+test()
+save_model()
}
class AbstractLoader{
<<Abstract>>
+load(path: Path)* Iterator[Document]
}
class AbstractModel{
<<Abstract>>
+prepare_dataset(documents)*
+forward(*args, **kwargs)*
+evaluate(dataloader: DataLoader, output_path: Path)*
+predict(documents: List[Document], dataloader: DataLoader, output_path: Path)*
+test(dataloader: DataLoader, output_path: Path)*
}
class AbstractDataset{
<<Abstract>>
#documents: Tuple[Document]
#prepared_docs: List[PreparedDocument]
+prepare_documents()*
}
class PreparedDocument{
<<NamedTuple>>
}
classDiagram
direction TB
Document "1" o-- "1..*" Word
Document "1" o-- "1..*" AbstractFact
AbstractFact <|-- EntityFact
AbstractFact <|-- RelationFact
Word "1..*" --o "1" Mention
Mention "1" --o "1..*" EntityFact : fact is mentioned in
AbstractFact "1" --> "1" FactClass : is a
class Document{
+doc_id: str
+text: str
+words: Tuple[Span]
+sentences: Tuple[Tuple[Span]]
+entity_facts: Tuple[EntityFact]
+relation_facts: Tuple[RelationFact]
+coreference_chains: Dict[int, Tuple[EntityFact]]
+add_relation_facts(facts: Iterable[RelationFact])
}
class Word{
+start_idx: int
+end_idx: int
}
class Mention{
+words: Tuple[Word]
}
class FactClass{
<<Enumeration>>
ENTITY
RELATION
}
class AbstractFact{
<<Abstract>>
+name: str
+type_id: str
+fact_class: FactClass
}
class EntityFact{
+coreference_id: int
+mentions: FrozenSet[Mention]
}
class RelationFact{
+from_fact: EntityFact
+to_fact: EntityFact
}
Need to implement a new loader (inherited from AbstractLoader class) that will convert new dataset examples to instances of Document class
Need to implement a new model (inherited from AbstractModel class) and corresponding set of datasets (inherited from AbstractDocument class)
-
bash scripts/download_datasets.sh
-
In order to download original TACRED dataset visit LDC TACRED webpage. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed. In addition to the original version of TACRED, we should also use the new label-corrected version of the TACRED dataset, which fixed a substantial portion of the dev/test labels in the original release. For more details, see the TACRED Revisited paper and their original code base
After downloading and processing:
- move tacred folder to
./etc/datasets
folder - put all patched files in the
./etc/dataset/tacred/data/json
directory
- move tacred folder to
cd path/to/project
docker build ./
docker run -it --gpus=all __image_id__ /bin/bash
pip3 install -r requirements/requirements.txt
bash scripts/main.sh -c path/to/config -v __gpu_id__ -s __seed__ -o path/to/model/output/dir