multitasking_transformers/examples/experiment_replication at master · NLPatVCU/multitasking_transformers

Name	Name	Last commit message	Last commit date
parent directory ..
clinical	clinical
raw_datasets	raw_datasets
README.md	README.md
__init__.py	__init__.py
preprocess_datasets.py	preprocess_datasets.py

Loading the data.

Unfortunately, PHI laws and agreements make distributing clinical datasets difficult. To ease the friction required to replicate our experiments, we provide a preprocessing script that takes as input the un-zipped dataset downloads from the data distributor and caches the tensors required to train our multitasking models.

Fortunately, two datasets are under no PHI license and so are included in the the package:

To run preprocess_datasets.py, you will need to set the directory of your installed BERT model (or alternatively a transformers library compatible alias). Set this by updating bert_weight_directory in preprocess_datasets.py. Running preprocess_datasets.py with the included datasets will work and you will be able to multitask train on the the two tasks (quaero_2014 and med_rqe).

To fully replicate the MT-Clinical-BERT training, you must download the remaining dataset's yourself. Med-NLI can be found here: med-nli and the rest of the datasets can be found at the DBMI Data Portal.

Datasets should be un-zipped into the appropriate sub-directory of raw_datasets. Once un-zipped in the correct directory, preprocess_datasets.py will handle the rest except for the i2b2-2010 dataset. For i2b2-2010, combine the extracted raw text files with the respective 'concept' sub-folders for both training and testing data. Then run the included con_to_brat.py file. You will need to run it three times - once for each training sub-set and then also on the testing set.

python con_to_brat.py ner/i2b2_2010/concept_assertion_relation_training_data/beth/concept ner/i2b2_2010/concept_assertion_relation_training_data/beth/concept
python con_to_brat.py ner/i2b2_2010/concept_assertion_relation_training_data/partners/concept ner/i2b2_2010/concept_assertion_relation_training_data/partners/concept
python con_to_brat.py ner/i2b2_2010/reference_standard_for_test_data/concepts/ ner/i2b2_2010/reference_standard_for_test_data/concepts/

For reference, your raw_datasets directory should look roughly like this:

├── __init__.py
├── language.py
├── ner
│   ├── data.py
│   ├── i2b2_2010
│   │   ├── concept_assertion_relation_training_data
│   │   ├── concept_assertion_relation_training_data.tar.gz
│   │   ├── reference_standard_for_test_data
│   │   ├── reference_standard_for_test_data.tar.gz
│   │   ├── test_data
│   │   └── test_data.tar.gz
│   ├── i2b2_2012
│   │   ├── 2012-07-15.original-annotation.release
│   │   ├── 2012-07-15.original-annotation.release.tar.gz
│   │   ├── 2012-08-23.test-data.groundtruth.tar.gz
│   │   └── ground_truth
│   ├── i2b2_2014
│   │   ├── 2014_training-PHI-Gold-Set1.tar.gz
│   │   ├── testing-PHI-Gold-fixed
│   │   ├── testing-PHI-Gold-fixed.tar.gz
│   │   ├── training-PHI-Gold-Set1
│   │   ├── training-PHI-Gold-Set2
│   │   └── training-PHI-Gold-Set2.tar.gz
│   ├── __init__.py
│   ├── n2c2_2018
│   │   ├── gold-standard-test-data
│   │   ├── gold-standard-test-data.zip
│   │   ├── training_20180910
│   │   └── training_20180910.zip
│   └── quaero_frenchmed_2014
│       ├── test
│       └── train
├── nli
│   ├── data.py
│   ├── __init__.py
│   ├── med_nli
│   │   ├── mli_test_v1.jsonl
│   │   └── mli_train_v1.jsonl
│   ├── med_rqe
│       ├── Readme.txt
│       ├── RQE_test_pairs_AMIA2016.xml
│       └── RQE_train_pairs_AMIA2016.xml
└── similarity
    ├── data.py
    ├── __init__.py
    ├── n2c2_2019
    │   ├── clinicalSTS2019.test.labels.txt
    │   ├── clinicalSTS2019.test.txt
    │   └── clinicalSTS2019.train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment_replication

experiment_replication

README.md

Loading the data.

Files

experiment_replication

Directory actions

More options

Directory actions

More options

Latest commit

History

experiment_replication

Folders and files

parent directory

README.md

Loading the data.