Skip to content

Relation extraction (RE) is the task of discovering entities' relations in weakly structured text. There are a lot of applications for RE. However, despite the increasing number of studies, there is a lack of cross-domain evaluation researches. The purpose of this work is to explore how models can be adapted to the changing types of entities

License

Notifications You must be signed in to change notification settings

phos-phophy/creed

Repository files navigation

Change of Relation Extraction's Entity Domain

Relation extraction (RE) is the task of discovering entities' relations in weakly structured text. There is a lot of applications for RE such as knowledge-base population, question answering, summarization and so on. However, despite the increasing number of studies, there is a lack of cross-domain evaluation researches. The purpose of this work is to explore how models can be adapted to the changing types of entities.

1. Adaptation methods

There are several ways to deal with the changing types of entities:

  1. Fine-tuning

    We can retrain our model on the new obtained data, but the main problem is to get and annotate new documents. For more details please refer to following studies: Domain Adaptation for Relation Extraction with Domain Adversarial Neural Network and Instance Weighting for Domain Adaptation in NLP.

  2. Ignoring

    Another way is to build a model that does not use any information of entities' types and just mark entities in the text with special tokens (e.g. [SUBJ] and [/SUBJ], [OBJ] and [/OBJ], etc.). For more details please refer to An Improved Baseline for Sentence-level Relation Extraction and A Generative Model for Relation Extraction and Classification.

  3. Mapping

    Reasonable way is to build a mapping from the model's entity types to another domain ones. But there may be situations when it is impossible to build the unambiguous mapping (e.g. diagram below, where new type NUMBER corresponds to 5 old ones).

    In the case of the unambiguous mapping, we can try all suitable mappings, but if there are $N$ entities and $M$ candidates for each of them, $M^N$ model runs are required.

  4. Diversified training

    The last method is called diversified training. Its key point is to change original entity types with the corresponding synonyms.

flowchart TB
    subgraph a["New unknown entity type"]
        subgraph b[" "]

            num([NUMBER])
        end
    end
    subgraph c["Familiar entity types"]
        subgraph d[" "]

            num(["NUMBER"]) --> quantity(["QUANTITY"])
            num(["NUMBER"]) --> percent(["PERCENT"])
            num(["NUMBER"]) --> ordinal(["ORDINAL"])
            num(["NUMBER"]) --> cardinal(["CARDINAL"])
            num(["NUMBER"]) --> money(["MONEY"])
        end
    end
Loading

The 2 main requirements of this work are:

  • adaptation method applicability without additional data;
  • immutability of the speed of the model work.

Therefore, we will only compare Ignoring and Diversified training methods

2. Achieved results

Adaptation methods Results, F1-score (%)
DocRED Re-TACRED
SSAN-Adapt DocUNet BERT_base
Ignoring 54.32 ± 0.05 60.24 ± 0.08 76.64 ± 0.37
Diversified training 51.62 ± 0.16 60.13 ± 0.08 74.96 ± 0.20

As we can see ignoring method has the best results.

3. Pipeline

The base classes are divided into 4 main categories:

  • Examples' features:
    • Word
    • Mention
    • FactClass
    • AbstractFact
      • EntityFact
      • RelationFact
  • Examples:
    • Document
    • PreparedDocument
    • AbstractDataset
  • Models:
    • AbstractModel
  • Utilities:
    • ModelManager
    • AbstractLoader

Examples' features

classDiagram
directionTB
    ModelManager "1" --> "1" AbstractModel : init, train and evaluate
    ModelManager "1" --> "1" AbstractLoader : use to load documents
    AbstractModel ..> AbstractDataset : use
    AbstractDataset "1" o-- "1..*" Document : process docs
    AbstractDataset "1" o-- "1..*" PreparedDocument : convert to prepared docs
    
    AbstractModel <|-- SSANAdapt
    AbstractModel <|-- DocUNet
    AbstractModel <|-- BertBaseline
    
    AbstractLoader <|-- DocREDLoader
    AbstractLoader <|-- TacredLoader
    
    class ModelManager{
        +config: ManagerConfig
        +loader: AbstractLoader
        +model: AbstractModel
        +train()
        +evaluate()
        +predict()
        +test()
        +save_model()
    }
    
    class AbstractLoader{
        <<Abstract>>
        +load(path: Path)* Iterator[Document]
    }
    
    class AbstractModel{
        <<Abstract>>
        +prepare_dataset(documents)*
        +forward(*args, **kwargs)*
        +evaluate(dataloader: DataLoader, output_path: Path)*
        +predict(documents: List[Document], dataloader: DataLoader, output_path: Path)*
        +test(dataloader: DataLoader, output_path: Path)*
    }
    
    class AbstractDataset{
        <<Abstract>>
        #documents: Tuple[Document]
        #prepared_docs: List[PreparedDocument]
        +prepare_documents()*
    }
    
    class PreparedDocument{
        <<NamedTuple>>
    }
Loading
classDiagram
direction TB

   Document "1" o-- "1..*" Word
   Document "1" o-- "1..*" AbstractFact
   AbstractFact <|-- EntityFact
   AbstractFact <|-- RelationFact
   Word "1..*" --o "1" Mention
   Mention "1" --o "1..*" EntityFact : fact is mentioned in
   AbstractFact "1" --> "1" FactClass : is a
   
   class Document{
      +doc_id: str
      +text: str
      +words: Tuple[Span]
      +sentences: Tuple[Tuple[Span]]
      +entity_facts: Tuple[EntityFact]
      +relation_facts: Tuple[RelationFact]
      +coreference_chains: Dict[int, Tuple[EntityFact]]
      +add_relation_facts(facts:  Iterable[RelationFact])
   }
   
   class Word{
      +start_idx: int
      +end_idx: int
   }
   
   class Mention{
       +words: Tuple[Word]
   }
   
   class FactClass{
      <<Enumeration>>
      ENTITY
      RELATION
   }

   class AbstractFact{
      <<Abstract>> 
      +name: str
      +type_id: str
      +fact_class: FactClass
   }
   
   class EntityFact{
      +coreference_id: int
      +mentions: FrozenSet[Mention]
   }
   
   class RelationFact{
      +from_fact: EntityFact
      +to_fact: EntityFact
   }
Loading

3.1 How to add new dataset?

Need to implement a new loader (inherited from AbstractLoader class) that will convert new dataset examples to instances of Document class

3.1 How to add new model?

Need to implement a new model (inherited from AbstractModel class) and corresponding set of datasets (inherited from AbstractDocument class)

4. How to run?

4.1 Download datasets

  1. bash scripts/download_datasets.sh

  2. In order to download original TACRED dataset visit LDC TACRED webpage. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed. In addition to the original version of TACRED, we should also use the new label-corrected version of the TACRED dataset, which fixed a substantial portion of the dev/test labels in the original release. For more details, see the TACRED Revisited paper and their original code base

    After downloading and processing:

    • move tacred folder to ./etc/datasets folder
    • put all patched files in the ./etc/dataset/tacred/data/json directory

4.2.1 Build docker container (optional)

  1. cd path/to/project
  2. docker build ./
  3. docker run -it --gpus=all __image_id__ /bin/bash

4.2.2 Or instead of building container, install requirements

pip3 install -r requirements/requirements.txt

4.3 Start training

bash scripts/main.sh -c path/to/config -v __gpu_id__ -s __seed__ -o path/to/model/output/dir

About

Relation extraction (RE) is the task of discovering entities' relations in weakly structured text. There are a lot of applications for RE. However, despite the increasing number of studies, there is a lack of cross-domain evaluation researches. The purpose of this work is to explore how models can be adapted to the changing types of entities

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published