Discontinuous Constituent Parsing as Sequence Labeling - EMNLP 2020 repository
- Ubuntu 18.04
- discodop
- Python 3.6+
- nltk 3.4.5
- pytorch 1.2.0
- transformers 2.5.1
- scikit-learn 0.21.3
-
Create a virtual environment:
virtualenv --python python3.6 $HOME/env/disco2labels
-
Activate the virtual enviroment:
source $HOME/env/disco2labels/bin/activate
-
To install the dependencies:
pip install -r requirements.txt
-
To install discodop follow these instructions (used to evaluate the models and optionally for model selection).
-
To install the resources used in this paper (e.g. embeddings, PoS taggers or templates for training configurations) execute
sh download.sh
. -
We released as well a few pretrained parsing models, check the pretrained parsing models section.
NOTE: Currently, only the discbracket format is supported as input format for the conversion.
cd disco2labels
python encode.py \
--train data/negra/train.discbracket \
--dev data/negra/dev.discbracket \
--test data/negra/test.discbracket \
--output data/negra_sl/pos-pointer/ \
--root_label \
--os \
--disc \
--split_char '{}' \
--disco_encoder pos-pointer \
--check_decode
The output will be three files to be stored at the previously created directory data/negra_sl/pos-pointer/
: train.tsv
, dev.tsv
and test.tsv
. They are composed of three columns: word, postags and labels. This format will be used to train and run the models too.
The options to encode a treebank (--disco_encoder
) are abs-idx|rel-idx|lehmer|lehmer-inverse|pos-pointer|pos-pointer-reduced
. Check the paper for details on the specifics for each encoding.
To encode the treebank with the strategy pos-pointer-reduced
you also need to specify the parameter --path_reduced_tagset
, i.e. --path_reduced_tagset resources/tagset_reduction_tiger_negra.txt
(for the NEGRA and TIGER German treebanks) or --path_reduced_tagset resources/tagset_reduction_dptb.txt
(for the DPTB English treebank)
To check all the parameter options: python encode.py --help
NOTE For simplicity, we use here a gold encoded file, but the same applies to predicted output files generated by a model.
cd disco2labels
python decode.py \
--input data/negra_sl/pos-pointer/train.tsv \
--output /tmp/train_decoded.tsv \
--disc \
--disco_encoder pos-pointer \
--split_char {} \
--os
To check all the parameter options: python decode.py --help
We used a modified version of the NCRFpp package that we include as a part of this repository:
cd disco2labels
python NCRF/main.py --config resources/ncrfpp_confs/train.negra.pos-pointer.bilstm.config
NOTE: To correctly train a model, please check the template at resources/ncrfpp_confs/train.negra.pos-pointer.bilstm.config
and verify whether you need to adapt the paths to the location of the data and resources in your computer.
We adapted a script released initially by huggingface🤗 to train BERT-based models for discontinuous constituent parsing as sequence labeling.
cd disco2labels
DistilBERT
CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model distilbert_model \
--transformer_pretrained_model distilbert-base-german-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.model \
--output_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.output \
--path_gold_parenthesized data/negra/dev.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.distilbert-base-german-cased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240
BERT
CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.output \
--path_gold_parenthesized data/negra/dev.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240
Some relevant options:
--transformer_model
:bert_model|distilbert_model
--transformer_pretrained_model
:bert-base-german-dbmdz-cased|distilbert-base-german-cased
(for German)bert-base-cased|bert-large-cased|distilbert-base-cased
(for English)--path_reduced_tagset
: Required when training a model using thepos-pointer-reduced
strategy
To check all the options: python run_token_classifier.py --help
You will need to specify a bert-base-uncased model (e.g. bert-base-german-dbmdz-uncased
) and also you will specify the parameter option --do_lower_case
.
DistilBERT
CUDA_VISIBLE_DEVICES=0 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-uncased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.output \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--log /tmp/negra.pos-pointer.bert-base-german-dbmdz-uncased.log \
--learning_rate 1e-5 \
--parsing_paradigm constituency \
--do_test --num_train_epochs 45 --train_batch_size 6 --max_seq_length 240 --do_lower_case
taskset --cpu-list 1 \
python run_ncrfpp.py \
--test data/negra_sl/pos-pointer/test.tsv \
--gold data/negra/test.discbracket \
--model /tmp/ncrfpp.bilstm.negra.pos-pointer \
--gpu True \
--output /tmp/ncrfpp.bilstm.negra.pos-pointer \
--disco_encoder pos-pointer \
--evalb_param proper.prm \
--os \
--ncrfpp NCRF
To check all the parameter options: python run_ncrfpp.py --help
Alternatively, if you simply want to run the model, you can create a decoding config file (check the template resources/ncrfpp_confs/decode.negra.pos-pointer.bilstm.config
):
python NCRF/main.py --config resources/ncrfpp_confs/decode.negra.pos-pointer.bilstm.config
We can use the same script we used for training the BERT-based models, but with the --do_test
option instead.
cd disco2labels
DistilBERT
CUDA_VISIBLE_DEVICES=0 taskset --cpu-list 1 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model distilbert_model \
--transformer_pretrained_model distilbert-base-german-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.distilbert-base-german-cased.model \
--output_dir /tmp/negra.pos-pointer.distilbert-base-german-cased \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--parsing_paradigm constituency --do_test --eval_batch_size 8 --max_seq_length 240
BERT
CUDA_VISIBLE_DEVICES=0 taskset --cpu-list 1 python run_token_classifier.py \
--data_dir data/negra_sl/pos-pointer/ \
--transformer_model bert_model \
--transformer_pretrained_model bert-base-german-dbmdz-cased \
--task_name sl_tsv \
--model_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased.model \
--output_dir /tmp/negra.pos-pointer.bert-base-german-dbmdz-cased \
--path_gold_parenthesized data/negra/test.discbracket \
--evalb_param proper.prm \
--label_split_char {} \
--disco_encoder pos-pointer \
--parsing_paradigm constituency --do_test --eval_batch_size 8 --max_seq_length 240
To check all the parameter options: python run_token_classifier.py --help
We also release the NCRFpp BILSTM PoS tagging models to be able to generate the same predicted PoS tags than us for the training, development and test sets. You can download these PoS taggers as a part of the resources used in this work here.
To run the PoS taggers, you just need to run the model with NCRFpp:
python NCRF/main.py --config resources/ncrf_confs_postaggers/decode.negra.pos.config
where the content of the config file would be something like:
### Decode ###
status=decode
raw_dir=data/postag_datasets/negra/train.tsv
decode_dir=/tmp/dptb_predpostags.tsv
dset_dir=resources/ncrfpp_postaggers/negra.ncrfpp.sskip.postagger.dset
load_model_dir=resources/ncrfpp_postaggers/negra.ncrfpp.sskip.postagger.model
and data/postag_datasets/negra/train.tsv
is a .tsv file with two columns: words and postags (which act as the labels in this case)
To generate a new .discbracket file with predicted PoS tags use the script scripts/discbracket_pred_postags.py
python scripts/discbracket_pred_postags.py \
--input_disbracket data/negra/train.discbracket \
--input_pred_tags /tmp/negra_train_predpostags.tsv \
--out_disbracket data/negra_pred/train.discbracket
Follow the regular encoding and training processes with the /data/negra_pred/train.discbracket
file, which now contains the predicted postags. Repeat the process for every split of the treebank.
To download the NEGRA, TIGER and DPTB NCRFpp BILSTM models trained with the pos-pointer
encoding, click here.
To download the NEGRA, TIGER and DPTB BERT models trained with the pos-pointer
encoding, click here.
- Add support for formats other than .discbracket.
- Save extra parameters in BERT models for an easier/simpler way to load and run them later.
- Improve robustness of the word to subword piece alignment for any random BERT-based model (specially for random uncased models).
David Vilares and Carlos Gómez-Rodríguez, Discontinuous Constituent Parsing as Sequence Labeling, to appear at EMNLP-2020. Punta Cana, Dominican Republic (online due to COVID-19).
This work has received funding from the European Research Council (ERC), under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150).