GitHub - MUSC-TBIC/2010-i2b2-VA-Challenge-Corpus-Augmentations

This repository provides documentation to explain our augmentations of a subset of the 2010 i2b2/VA Challenge corpus to include any laboratory test values associated with an already annotated laboratory test name. These laboratory test values are also linked through annotation relations with their respective laboratory test name.

Due to the conditions of the data use agreement (DUA) of the original corpus, you are accessing these annotations via Harvard Medical School's Department of Biomedical Informatics (DBMI) data portal. This documentation for these annotations (but not the annotations themselves) is replicated in the following repository: https://github.com/MUSC-TBIC/2010-i2b2-VA-Challenge-Corpus-Augmentations

Annotation Process

Heider et al. [1] present further details on the annotation process. We began with the normalized concept annotations available in the 2019 n2c2 NLP challenge corpus [2]. These annotations included problems, test, and treatments. We wanted only those annotations related to laboratory test names. As such, we filtered to include any annotations with a concept unique identifier (CUI) with the following semantic types:

Laboratory Procedure (T059)
Laboratory or Test Result (T034)

These annotations were automatically added to the laboratory test name list. We then used a second filter to find all annotations with a semantic type from the following list:

Antibiotic (T195)
Bacterium (T007)
Biologically Active Substance (T123)
Fungus (T004)
Immunologic Factor (T129)
Pharmacologic Substance (T121)
Virus (T005)
Vitamin (T127)

This second set of annotations was manually reviewed for inclusion (or exclusion).

Between these two lists, we had a subset of all original annotations that related to laboratory test names (and their CUIs). The text surrounding all of these annotations was automatically searched for potential laboratory test values. The first search criteria to succeed was treated as the primary laboratory test value to associate with a given laboratory test name. The three search criteria were:

the nearest numeric expression (e.g., '10,000', '30%') up to fifty characters after a concept
the nearest numeric expression up to 10 characters before the concept
the closest categorical value (e.g., 'positive' or 'normal') up to fifty characters away

These automatically discovered laboratory test values (and their relation arches back to laboratory test names) were manually verified using WebAnno. Any missed laboratory test values were added. Any incorrectly linked laboratory test values were removed.

A single individual did all of the manual verification. As such, there is no inter annotator agreement.

Directory Structure

The abstract for Heider et al. [1] is included in the docs folder. (The two other publications, which document the corpus creation process of the antecedent versions, can be retrieved via the provided links.)

The notes (and annotations) have been split into a 50/50 train/test split as per Luo et al.'s [2] corpus. The original notes (in plain text format) are under the *_note folders. The brat .ann formatted annotation files are in the *_ann folders.

The configuration files using for manual verification in WebAnno are provided in the webanno directory.

.
|-- docs
|-- test
|  |-- test_ann
|   `-- test_note
|-- train
|   |-- train_ann
|   `-- train_note
`-- webanno

Links

This documentation for these annotations (but not the annotations themselves) is replicated in the following repository: https://github.com/MUSC-TBIC/2010-i2b2-VA-Challenge-Corpus-Augmentations
Supplemental script repository (within the n2c2 subdirectory): https://github.com/MUSC-TBIC/corpus-utils
Original challenge overview page: 2019 n2c2 Shared-Task and Workshop Track 3: n2c2/UMass Track on Clinical Concept Normalization
Data page: n2c2 2019 - Track 3: Clinical Concept Normalization

References

1. Heider PM, Kim Y, Meystre SM. Semi-Automated Corpus Augmentation Methods for Enriching Laboratory Test Names with Value Annotations. AMIA Informatics Summit. 2021.

2. Luo YF, Sun W, Rumshisky A. MCN: A Comprehensive Corpus for Medical Concept Normalization. Journal of Biomedical Informatics. 2019 Feb 22:103132.

3. Uzuner O, South BR, Shen S, DuVall SL, 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text, Journal of the American Medical Informatics Association, Volume 18, Issue 5, September 2011, Pages 552-556.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
webanno		webanno
.gitignore		.gitignore
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Annotation Process

Directory Structure

Links

References

About

Uh oh!

Releases

Packages

MUSC-TBIC/2010-i2b2-VA-Challenge-Corpus-Augmentations

Folders and files

Latest commit

History

Repository files navigation

Annotation Process

Directory Structure

Links

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages