Language Model Finetuning on Token Classification Task

We provide a finetuning script (./scripts/downstream/train_token_classification_lm_finetuning.py) to finetune our pretrained language model on 3 multiclass classification tasks ( wisesight_sentiment, wongnai_reviews, generated_reviews_enth : review_star ) and 1 multilabel classification task (prachathai67k).

The arguements for the train_sequence_classification_lm_finetuning.py are as follows:

Arguments:

--model_name_or_path :

The pretrained model checkpoint for weights initialization.

Otherwise, specify other public language model (Currently, we support mbert and xlmr )
--tokenizer_name_or_path :

The directory of tokenizer's vocab. Otherwise,
--dataset_name :

Specify the target labels for the token classification datasets. The target labels include ner_tags for Named-entity tagging and pos_tags for Part-of-Speech tagging.
--tokenizer_type :

Specify the type of tokenizer including ThaiRobertaTokenizer ,ThaiWordsNewmmTokenizer, ThaiWordsSyllableTokenizer,

FakeSefrCutTokenizer, and CamembertTokenizer (for roberthai-95g-spm).

Otherwise, use AutoTokenizer for public model.
--output_dir :

The directory to store the finetuned model checkpoints.
--lst20_data_dir :

The directory to the LST20 dataset as lst20 is required to download manually.
--per_device_train_batch_size : The train batch size
--per_device_eval_batch_size : The train batch size
--space_token : The custom token that will replace a space token in the texts. As some models use custom space token (default: "<_>"). For mbert and xlmr specify the space token as " ".
--max_length: Specify the max length of text inputs to be passed to the model, The max length should be less than the max positional embedding or the max sequence length that langauge model was pretrained on.
--num_train_epochs: Number of epochs to finetune model (default: 5)
--learning_rate: The value of peak learning rate (default: 1e-05)
--weight_decay : The value of weight decay (default: 0.01)
--warmup_steps: The number of steps to warmup learning rate (default: 0)
--no_cuda: Append "--no_cuda" to use only CPUs during finetuning (default: False)
--fp16: Append "--fp16" to use FP16 mixed-precision trianing (default: False)
--metric_for_best_model: The metric to select the best model based on validation set (default: loss)
--greater_is_better: The criteria to select the best model according to the specified metric either by expecting the greater value or lower value (default: False if the metric_for_best_model is not "loss")
--logging_steps : In interval of training steps to perform logging (default: 10)
--seed : The seed value (default: 2020)
--fp16_opt_level : The OPT level for FP16 mixed-precision training (default: O1)
--gradient_accumulation_steps : The number of steps to accumulate gradients (default: 1, no gradient accumulation)
--adam_epsilon : Value of Adam epsilon (default: 1e-05)
--max_grad_norm : Value of gradient norm (default: 1.0)
--lowercase : Append "--lowercase" to convert all input texts to lowercase as some model may support only uncased texts (default: False)
--run_name : Specify the run_name for logging experiment to wandb.com (default: False)

Example

Finetuning roberthai-95g-spm model on NER tagging task of thainer dataset.

The following script will finetune the roberthai-thwiki-spm pretrained model from checkpoint:7000.

The script will finetune model with FP16 mixed-precision training on 2 GPUs (ID: 1,2). The train and validation batch size is 16 with no gradient accumulation. The model checkpoint will be save every 250 steps and select the best model by validation loss. During finetuning, the learning rate will be warmed up linearly until 5e-05 for 100 steps, then linearly decay to zero. The maximum sequence length that the model will be passed (from the resuling number of tokens according to the tokenizer specified). Otherwise, it will truncate the sequence to max_length. Note that, --lowercase is appened to the arugment list as roberthai-95g-spm only support uncased text (all lowercase text). Space token is set to "<th_roberta_space_token>" as the model use this token for space token.

cd ./scripts/downstream
CUDA_VISIBLE_DEVICES=1,2 python train_token_classification_lm_finetuning.py \
--tokenizer_type CamembertTokenizer \
--tokenizer_name_or_path /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder \
--model_name_or_path /workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000 \
--dataset_name thainer \
--label_name ner_tags \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 5e-5 \
--warmup_steps 100  \
--logging_steps 50 \
--eval_steps 250 \
--max_steps 1000  \
--evaluation_strategy steps \
--output_dir /workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1 \
--do_train \
--do_eval \
--max_length 510 \
--fp16 \
--space_token "<th_roberta_space_token>" \
--lowercase

Log output:


01/15/2021 09:45:09 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 2distributed training: False, 16-bits training: True
01/15/2021 09:45:09 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=16, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=1000, warmup_steps=100, logging_dir='runs/Jan15_09-45-09_IST-DGX01', logging_first_step=False, logging_steps=50, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=250, dataloader_num_workers=0, past_index=-1, run_name='/workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None)
01/15/2021 09:45:09 - INFO - __main__ -   Data parameters DataTrainingArguments(dataset_name='thainer', label_name='ner_tags', max_length=510)
01/15/2021 09:45:09 - INFO - __main__ -   Model parameters ModelArguments(model_name_or_path='/workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000', tokenizer_name_or_path='/workspace/checkpoints/roberthai-95g-spm/tokenizer_folder', tokenizer_type='CamembertTokenizer')
01/15/2021 09:45:09 - INFO - __main__ -   Custom args CustomArguments(no_train_report=False, no_eval_report=False, no_test_report=False, lst20_data_dir=None, space_token='<th_roberta_space_token>', lowercase=True)
Model name '/workspace/checkpoints/roberthai-95g-spm/tokenizer_folder' not found in model shortcut name list (camembert-base). Assuming '/workspace/checkpoints/roberthai-95g-spm/tokenizer_folder' is a path, a model identifier, or url to a directory containing tokenizer files.
Didn't find file /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder/added_tokens.json. We won't load it.
Didn't find file /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder/special_tokens_map.json. We won't load it.
Didn't find file /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder/tokenizer_config.json. We won't load it.
Didn't find file /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder/tokenizer.json. We won't load it.
loading file /workspace/checkpoints/roberthai-95g-spm/tokenizer_folder/sentencepiece.bpe.model
loading file None
loading file None
loading file None
loading file None
01/15/2021 09:45:09 - INFO - __main__ -   [INFO] space_token = `<th_roberta_space_token>`
Reusing dataset thainer (/root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92)
Loading cached processed dataset at /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-fac20625c90fe862.arrow
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-e1c5648ecd5c184a.arrow and /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-cf0c77b9ce362f6d.arrow
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-e1f36698c1dabb82.arrow and /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-0132859955c1ebe7.arrow
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-6556fccfbcd0cbf4.arrow and /root/.cache/huggingface/datasets/thainer/thainer/1.3.0/e0a86672e5ad057c1093708597cdda3671a76e9b053d210a32205406726cca92/cache-eb99b34850b9ceb8.arrow
loading configuration file /workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000/config.json
Model config RobertaConfig {
"architectures": [
    "RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "21": "LABEL_21",
    "22": "LABEL_22",
    "23": "LABEL_23",
    "24": "LABEL_24",
    "25": "LABEL_25",
    "26": "LABEL_26",
    "27": "LABEL_27"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "LABEL_14": 14,
    "LABEL_15": 15,
    "LABEL_16": 16,
    "LABEL_17": 17,
    "LABEL_18": 18,
    "LABEL_19": 19,
    "LABEL_2": 2,
    "LABEL_20": 20,
    "LABEL_21": 21,
    "LABEL_22": 22,
    "LABEL_23": 23,
    "LABEL_24": 24,
    "LABEL_25": 25,
    "LABEL_26": 26,
    "LABEL_27": 27,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "roberta",
"num_attention_head": 12,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 25005
}

loading weights file /workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000/pytorch_model.bin
Some weights of the model checkpoint at /workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000 were not used when initializing RobertaForTokenClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at /workspace/checkpoints/roberthai-95g-spm/model/checkpoint-320000 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: old_positions.
The following columns in the evaluation set don't have a corresponding argument in `RobertaForTokenClassification.forward` and have been ignored: old_positions.
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
***** Running training *****
Num examples = 5077
Num Epochs = 7
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 1000
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Offline run mode, not syncing to the cloud.
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` to enable cloud syncing.
0%|          | 0/1000 [00:00<?, ?it/s]
0%|          | 1/1000 [01:26<23:55:50, 86.24s/it]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
0%|          | 4/1000 [01:27<8:14:06, 29.77s/it] Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
1%|          | 8/1000 [01:28<2:01:03,  7.32s/it]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
2%|▎         | 25/1000 [01:44<14:56,  1.09it/s]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
4%|▍         | 44/1000 [01:51<04:05,  3.89it/s]Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
5%|▌         | 50/1000 [01:52<03:41,  4.29it/s]
{'loss': 1.863900146484375, 'learning_rate': 2.5e-05, 'epoch': 0.31446540880503143}
10%|█         | 100/1000 [02:18<38:19,  2.55s/it]
{'loss': 0.46619827270507813, 'learning_rate': 5e-05, 'epoch': 0.6289308176100629}
{'loss': 0.18838241577148437, 'learning_rate': 4.722222222222222e-05, 'epoch': 0.943396226}
15%|█▌        | 150/1000 [02:31<03:53,  3.64it/s]
20%|██        | 200/1000 [02:52<33:10,  2.49s/it]
{'loss': 0.12642303466796875, 'learning_rate': 4.4444444444444447e-05, 'epoch': 1.25786163 
20%|██        | 200/1000 [02:52<33:10,  2.49s/it]
{'loss': 0.1191162109375, 'learning_rate': 4.166666666666667e-05, 'epoch': 1.5723270440251573}
25%|██▌       | 250/1000 [03:05<03:03,  4.08it/s]

***** Running Evaluation *****
Num examples = 635
Batch size = 32

01/15/2021 09:48:36 - INFO - /opt/conda/lib/python3.6/site-packages/datasets/metric.py -   Removing /root/.cache/huggingface/metrics/seqeval/default/default_experiment-1-0.arrow
{'eval_loss': 0.10173556208610535, 'eval_precision': 0.8637927080944737, 'eval_recall': 0.8817883895131086, 'eval_f1': 0.8726977875593652, 'eval_accuracy': 0.9725851004174542, 'epoch': 1.5723270440251573}
30%|███       | 300/1000 [03:27<16:32,  1.42s/it]{'loss': 0.10812286376953124, 'learning_rate': 3.888888888888889e-05, 'epoch': 1.8867924528301887}                                       
34%|███▍      | 341/1000 [03:37<02:52,  3.82it/s]
35%|███▌      | 350/1000 [03:40<03:06,  3.49it/s]
40%|███▉      | 399/1000 [03:53<02:34,  3.89it/s]
50%|█████     | 500/1000 [05:12<23:33,  2.83s/it]

***** Running Evaluation *****
Num examples = 635
Batch size = 32

01/15/2021 09:50:43 - INFO - /opt/conda/lib/python3.6/site-packages/datasets/metric.py -   Removing /root/.cache/huggingface/metrics/seqeval/default/default_experiment-1-0.arrow
{'eval_loss': 0.08590535074472427, 'eval_precision': 0.878561736770692, 'eval_recall': 0.9094101123595506, 'eval_f1': 0.8937198067632851, 'e 50%|█████     | 500/1000 [05:17<23:33,  2.83s
/Saving model checkpoint to /workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1/checkpoint-500                                

Configuration saved in /workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1/checkpoint-500/config.json
Model weights saved in /workspacex/checkpoints/roberthai-95g-spm/finetuned/thainer/ner/v1/checkpoint-500/pytorch_model.bin
/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
53%|█████▎    | 526/1000 [05:32<02:03,  3.83it/s]
{'loss': 0.0573553466796875, 'learning_rate': 2.5e-05, 'epoch': 3.459119496855346}
{'loss': 0.05275115966796875, 'learning_rate': 2.2222222222222223e-05, 'epoch': 3.77358490 60%|██████    | 600/1000 [06:05<28:12,  4.23s/it]
65%|██████▌   | 650/1000 [06:16<01:21,  4.30it/s]{'loss': 0.05139984130859375, 'learning_rate': 1.9444444444444445e-05, 'epoch': 4.08805031                                                  
{'loss': 0.0428802490234375, 'learning_rate': 1.6666666666666667e-05, 'epoch': 4.40251572327044}
71%|███████   | 710/1000 [06:43<01:43,  2.81it/s]
***** Running Evaluation *****
  Num examples = 635
Batch size = 32

{'eval_loss': 0.08580297976732254, 'eval_precision': 0.8899543378995434, 'eval_recall': 0.9124531835205992, 'eval_f1': 0.9010633379565418, 'eval_accuracy': 0.9763442646783942, 'epoch': 4.716981132075472}

                                                {'loss': 0.037841796875, 'learning_rate': 1.1111111111111112e-05, 'epoch': 5.031446540880503}                                            
85%|████████▌ | 850/1000 [07:27<00:40,  3.69it/s]3333333333334e-06, 'epoch': 5.345911949685535}
90%|████████▉ | 899/1000 [07:38<00:25,  3.93it/s]
95%|█████████▌| 950/1000 [08:09<00:12,  3.89it/s]
100%|█████████▉| 999/1000 [08:21<00:00,  4.36it/s]

Chunk-level per-class precision, recall, and F1-score on test set.


    Processed: 635 / 635 [ Test Result ]

    {
        'accuracy': 0.980321583662611,
        'f1_macro': 0.9132072525127524,
        'f1_micro': 0.8947951273532668,
        'nb_samples': 635,
        'precision_macro': 0.8956733587500255,
        'precision_micro': 0.8749323226854359,
        'recall_macro': 0.9329612501419587,
        'recall_micro': 0.9155807365439094
    }

                    precision    recall  f1-score   support

             DATE     0.8955    0.9231    0.9091       195
            EMAIL     1.0000    1.0000    1.0000         1
              LAW     0.8667    0.8667    0.8667        15
              LEN     0.8095    0.9444    0.8718        18
         LOCATION     0.8384    0.8913    0.8641       460
            MONEY     0.9804    0.9804    0.9804        51
     ORGANIZATION     0.8731    0.9075    0.8900       584
          PERCENT     0.9333    0.8750    0.9032        16
           PERSON     0.9403    0.9708    0.9553       308
            PHONE     0.8462    0.9167    0.8800        12
             TIME     0.7714    0.8526    0.8100        95
              URL     0.8889    1.0000    0.9412         8
              ZIP     1.0000    1.0000    1.0000         2

    micro avg         0.8749    0.9156    0.8948      1765
    macro avg         0.8957    0.9330    0.9132      1765
    weighted avg      0.8759    0.9156    0.8951      1765

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5b_finetune_token_classificaition.md

5b_finetune_token_classificaition.md

Language Model Finetuning on Token Classification Task

Example

Files

5b_finetune_token_classificaition.md

Latest commit

History

5b_finetune_token_classificaition.md

File metadata and controls

Language Model Finetuning on Token Classification Task

Example