Skip to content

Latest commit

 

History

History
91 lines (75 loc) · 3.28 KB

README.md

File metadata and controls

91 lines (75 loc) · 3.28 KB

COVID-Bert4Classification

Use BERT to solve a multi-label text classification problem.

Architecture

The model is simply a BERT model followed by a linear classifier, using BP-MLL function as loss function.

Before using the model

Set up the environment

# This script will automatically download all the packages and models needed.
python set_env.py

Train the model

Simply run the command as you have downloaded the pretrained BERT.

make
tail -f logger.log  # read the log file

Evaluate the model

make evaluate
cd result  # the html form result are put in result/

Predict labels of the texts

make predict  # just for test

To use in a Python script, use the code below

>>> from .predict import Prediction

>>> text = "Some text here"
>>> Prediction.predict(text)
{'Treatment': Label('has_label'=True, 'prob'=0.98), ...}

>>> texts = ["many", "texts", "here"]
>>> Prediction.predict(text)
[{'Treatment': Label('has_label'=True, 'prob'=0.98), ...}, ...]

Add entries to MongDB

python insert_to_db.py

Clear models in model and results in result

make clean

Explanation to the parameters in config/config.json

IO

Save and load the trained model, used in load function in model.py

  • model_dir: point to the directory where the model (not pretrained model) is saved

HyperParam

Hyper parameters that control the whole training process, used in train.py

  • batch_size: the number of samples that feed into the model at a time (should not be too big, or the model will need too much memory)
  • lr: learning rate (5e-5, 3e-5, 1e-5 or something like that)
  • epoch: number of iteration on the whole training set (depend on the size of the training set)
  • accumulation_step: Here we use gradient accumulation technique to get the performance of a larger batch size (the actual batch size is approximately batch_size * accumulation_step).

Loss

Parameters for the loss function (here we use BP-MLL loss function), used for initilizing bp_mll loss function in bp_mll.py

  • bias: the weight of positive label and negative label (refer to this paper)

Network

Parameters used for initializing the neural network, used in model.py

  • pretrained_model: point to the directory containing the pretained bert model (here we use SciBert by Allen AI)
  • hidden_size: the size of the pretrained model's hidden size (768 here)
  • dropout_prob: the probability for dropout layer to drop a element in the input tensor
  • label_num: the total number of labels

Dataset

Used in dataset.py for loading the dataset from file

  • tokenizer_path: For pretrained model, keep it the same as IO.model_dir
  • dataset_path: point to the json file where annotated data is stored
  • text_key: key of the text for each entry
  • label_key: key of the labels for each entry

Predict

Used for evaluation and prediction

  • position_threshold: minumum output for an output to be considered as a positive output (predicted_label = output_probility > position_threshold ? 1 : 0)

TODO

  • Can only reload the model on a single GPU server.
  • Unable to load the optimizer for continuing training.
  • Better performance for training on multiple GPUs.