MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Low-Quality Maintenance Short Texts

This repository contains the data, models, and code accompanying the paper titled MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Low-Quality Maintenance Short Texts published in LREC-Coling 2024.

1. Overview

Maintenance work orders (MWO) are concise and information-rich, user-generated technical texts capturing data on the state of, and work on, machines, infrastructure and other engineered assets. These assets are the foundation of our modern economy. Information captured in MWO is vital for asset management decision-making but is challenging to extract and use at scale.

This repository contains MaintIE, a multi-level fine-grained annotation scheme for entity recognition and relation extraction, consisting of 5 top-level classes: PhysicalObject, State, Process, Activity and Property and 224 leaf entities, along with 6 relations tailored to MWOs. Using MaintIE, we have curated a multi-annotator, high-quality, fine-grained corpus of 1,076 annotated texts . Additionally, we present a coarse-grained corpus of 7,000 texts and consider its performance for bootstrapping and enhancing fine-grained information extraction. Using these corpora, we provide model performance measures for benchmarking automated entity recognition and relation extraction. This repository contains the MaintIE scheme, corpus, and models, which are available under the MIT license, encouraging further community exploration and innovation in extracting valuable insights from MWOs.

2. Scheme

The MaintIE annotation scheme is described in the Scheme section of the repository.

3. Corpora

3.1. Overview

The annotated MaintIE corpora is composed of two sub-corpora - 1) the Fine-Grained Expert-Annotated corpus (./data/gold_release.json), and 2) the Coarse-Grained Large-Scale corpus (./data/silver_release.json). Statistics of the MaintIE corpora, including the top-level entities and relations in these two sub-corpora, are outlined below.

3.2. Format

Both corpora comprise a set of items pertaining to texts annotated with entities and relations. An example item is shown below.

[
  {
    "text": "change out engine",
    "tokens": ["change", "out", "engine"],
    "entities": [
      {
        "start": 0,
        "end": 2,
        "type": "Activity/MaintenanceActivity/Replace"
      },
      {
        "start": 3,
        "end": 4,
        "type": "PhysicalObject/DrivingObject/CombustionEngine"
      }
    ],
    "relations": [
      {
        "head": 0,
        "tail": 1,
        "type": "hasParticipant/hasPatient"
      }
    ]
  }
]

Where the fields correspond to:

text: Lexically normalised and sanitised MWO short text (string)
tokens: Tokenized text (Array[string])
entities: Entities corresponding to tokens in tokens (Array[Object])
- start: Start token index of entity span (integer)
- end: End token index of entity span (integer)
- type: Entity type corresponding to the entity span (string)
relations: Relations between entities in entities (Array[Object])
- head: Index of head entity (integer)
- tail: Index of tail entity (integer)
- type: Relation type corresponding to the relation (string)

3.3. Normalisation and Sanitisation

Before undergoing semantic annotation, the MaintIE corpus underwent two primary preprocessing steps: normalisation and sanitisation.

Normalisation involved:

Converting non-canonical words to their canonical forms.
Expanding abbreviations.
Correcting character casing.

Sanitisation ensured the privacy and relevance of the data by:

Masking sensitive information with the token <sensitive>.
Representing non-semantic data, such as IDs, numbers, and dates, with the respective tokens: <id>, <num>, and <date>.

For examples illustrating this transformation process, refer to the overview section.

3.4. Statistics

3.4.1. Overview

Measure	Value
Total Texts	8,076 (1,076 + 7,000)
Total Tokens	43,674
Unique Tokens (Vocabulary)	2,409
Minimum Tokens / Text	1
Maximum Tokens / Text	13
Average Tokens / Text	5.4

3.4.2. Fine-Grained Expert-Annotated Corpus

The Fine-Grained Expert-Annotated corpus (gold standard) comprises 1,067 texts double annotated by two domain experts. The table below contains the distribution of top-level entities and relations in this corpus.

		Total		Unique
#	Entity Type	Count	%	Count	%
1	Activity	278	9.0	27	9.0
2	PhysicalObject	1,994	58.7	222	73.8
3	Process	146	4.0	9	3.0
4	Property	35	1.0	1	0.3
5	State	438	12.9	42	14.0
		3,397	100	301	100
#	Relation Type	Count	%	Count	%
1	contains	38	1.6	15	0.9
2	hasPart	533	22.8	417	23.7
3	hasParticipant/hasAgent	166	7.1	127	7.2
4	hasParticipant/hasPatient	1,206	51.5	936	53.3
5	hasProperty	34	1.5	28	1.6
6	isA	364	15.5	234	13.3
	Total	2,341	100	1,757	100

3.4.3. Coarse-Grained Large-Scale Corpus

The Coarse-Grained Large-Scale corpus (silver standard) comprises 7,000 texts annotated by a deep learning model trained on the fine-grained corpus and subsequently reviewed and validated by a single domain expert. The table below contains the distribution of top-level entities and relations in this corpus.

		Total		Unique
#	Entity Type	Count	%	Count	%
1	Activity	5,045	23.0	373	11.0
2	PhysicalObject	13,472	61.0	2,379	72.0
3	Process	728	3.0	118	4.0
4	Property	130	1.0	32	1.0
5	State	2,747	12.0	396	12.0
		22,122	100	3,298	100
#	Relation Type	Count	%	Count	%
1	contains	178	1.2	137	1.1
2	hasPart	3,873	25.5	3,290	25.5
3	hasParticipant/hasAgent	789	5.6	716	5.6
4	hasParticipant/hasPatient	7,761	52.3	6,745	52.3
5	hasProperty	123	0.9	116	0.9
6	isA	2,512	14.7	1,894	14.7
	Total	15,200	100	12,898	100

4. Models

We've conducted experiments using token-classification (SpERT) and sequence-to-sequence (REBEL) models to enable automatic information extraction from MWO short texts. For comprehensive details about the models, their training methodologies, and steps to reproduce our experiments, kindly refer to the Models section in this repository.

5. Results

The detailed results of the entity and relation extraction models are provided in the Results section. It includes per-class entity and relation F1 scores, both micro and macro, alongside other key evaluation metrics. These results are crucial in understanding the performance of the models in identifying entities and relationships within a given text corpus.

6. License

This project is protected under the MIT License. Check out the LICENSE file for detailed licensing information.

7. Contributing

Feedback and contributions are always appreciated. If you encounter discrepancies in the corpora or see opportunities for model enhancement, please don't hesitate to submit a pull request for our evaluation. Additionally, should you have any questions or need clarification about the contents of this repository, reach out to us.

8. Contact

For any specific inquiries or discussions, kindly get in touch:

tyler.bikaun@research.uwa.edu.au

9. Attribution

If you use MaintIE, please cite us!

@inproceedings{bikaun-etal-2024-maintie-fine,
    title = "{M}aint{IE}: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Maintenance Short Texts",
    author = "Bikaun, Tyler K.  and
      French, Tim  and
      Stewart, Michael  and
      Liu, Wei  and
      Hodkiewicz, Melinda",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.954",
    pages = "10939--10951",
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
models		models
.gitignore		.gitignore
GUIDELINES.md		GUIDELINES.md
LICENSE.md		LICENSE.md
MODELS.md		MODELS.md
README.md		README.md
RESULTS.md		RESULTS.md
SCHEME.md		SCHEME.md
annotation_and_experiment_process.png		annotation_and_experiment_process.png
example_tagged_texts.png		example_tagged_texts.png
maintie_entity_hierarchy.png		maintie_entity_hierarchy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Low-Quality Maintenance Short Texts

Table of Contents

1. Overview

2. Scheme

3. Corpora

3.1. Overview

3.2. Format

3.3. Normalisation and Sanitisation

3.4. Statistics

3.4.1. Overview

3.4.2. Fine-Grained Expert-Annotated Corpus

3.4.3. Coarse-Grained Large-Scale Corpus

4. Models

5. Results

6. License

7. Contributing

8. Contact

9. Attribution

About

Languages

License

nlp-tlp/maintie

Folders and files

Latest commit

History

Repository files navigation

MaintIE: A Fine-Grained Annotation Schema and Benchmark for Information Extraction from Low-Quality Maintenance Short Texts

Table of Contents

1. Overview

2. Scheme

3. Corpora

3.1. Overview

3.2. Format

3.3. Normalisation and Sanitisation

3.4. Statistics

3.4.1. Overview

3.4.2. Fine-Grained Expert-Annotated Corpus

3.4.3. Coarse-Grained Large-Scale Corpus

4. Models

5. Results

6. License

7. Contributing

8. Contact

9. Attribution

About

Topics

Resources

License

Stars

Watchers

Forks

Languages