We introduce ADELIE (Aligning large language moDELs on Information Extraction), an aligned LLM that effectively solves various IE tasks, including closed IE, open IE, and on-demand IE. We first collect and construct a high-quality alignment corpus IEInstruct for IE. Then we train ADELIESFT using instruction tuning on IEInstruct. We further train ADELIESFT with direct preference optimization (DPO) objective, resulting in ADELIEDPO. Extensive experiments on various held-out IE datasets demonstrate that our models (ADELIESFT and ADELIEDPO) achieve state-of-the-art (SoTA) performance among open-source models. We further explore the general capabilities of ADELIE, and experimental results reveal that their general capabilities do not exhibit a noticeable decline.
- 📖 Paper: ADELIE: Aligning Large Language Models on Information Extraction
- 🐧 ADELIE in the 🤗HuggingFace Hub: THU-KEG/ADELIE
- 🌟 IEInstruct and IEFeedback: Datasets
The code repository is based on Pytorch and Transformers. Please use the following command to install the necessary dependcies. pip install -r requirements.txt.
We release three ADELIE models based on LLama-2 (7B). The models are available in the 🤗HuggingFace Hub.
Model | IE Average F1 (%) | General Average Score (%) | 🤗HuggingFace Hub |
---|---|---|---|
ADELIE-SFT | 47.5 | 53.5 | ADELIE-SFT |
ADELIE-DPO | 47.7 | 53.8 | ADELIE-DPO |
ADELIESFT is trained on IEInstruct.
And it is further trained with direct preference optimization (DPO) objective on IEFeedback, resulting in ADELIEDPO.
Among our training and testing tasks, the copyright of TACRED, ACE 2005, and RichERE belongs to LDC2 and we access them through our LDC membership. All the other datasets are open-sourced, and we strictly adhere to their licenses.
We remove the non-open source datasets from IEInstruct and IEFeedback, and make these two training datasets public. You can download the data from ADELIE Datasets.
To access the full version of the IEInstruct and evaluation dataset, first install the entire raw dataset as prepared in the data/Readme.md
file, then proceed with the following instructions:
#Generate a unified data format
sh ./scripts/generate_unified_data.sh
#Generate IEInstruct mixture
sh ./scripts/generate_mixtural_train_data.sh
#Generate sampled data
sh ./scripts/generate_dpo_sample_data.sh
#Sample output from ADELIE-SFT
sh ./train4llama/scripts/predict.sh
#Generate IEFeedback mixture
sh ./scripts/generate_mixtural_dpo_data.sh
First, you need to generate the ADELIE dataset.
Second, you can train ADELIE-SFT and ADELIE-DPO by running the following command.
# ADELIE-SFT:
sh train4llama/scripts/finetune_with_accelerate.sh
# ADELIE-DPO:
sh train4llama/scripts/dpo_train_with_accelerate.sh
Please note that the training data for DPO includes ADELIE-SFT generation. Therefore, upon completing the ADELIE-SFT training, it is necessary to generate DPO training data following the method mentioned above for IEFeedback dataset generation.
Our training code is based on the open-instruct。
We have publicly released preprocessed test datasets for evaluation of IE capabilities, excluding the RichERE dataset. Execute the following command to perform IE ability testing.
Note: For ondemand-IE and Open IE datasets, it is necessary to download the raw data from ODIE and ROBUST respectively, and place them in the data directory before evaluation can proceed.
sh ./train4llama/scripts/eval.sh
@misc{qi2024adelie,
title={ADELIE: Aligning Large Language Models on Information Extraction},
author={Yunjia Qi and Hao Peng and Xiaozhi Wang and Bin Xu and Lei Hou and Juanzi Li},
year={2024},
eprint={2405.05008},
archivePrefix={arXiv},
primaryClass={cs.CL}
}