Unified Tokenization and Parsing framework (UniTP) in PyTorch for our two papers in ACL Findings 2021 and TACL 2023. This is Neural Combinatory Constituency Parsing (NCCP) family which also performs addtional word segmentation (WS), sentiment analysis (SA), named entity recoginition (NER).
This project is extended from https://github.com/tmu-nlp/nccp.
For models with fastText,
pip install -r requirements/minimal.txt
- Install fastText and configurate values under path
tool:fasttext:
in file000/manager.yaml
.
Additional requirements:
pip install -r requirements/full.txt
for NCCP models with huggingface transformers.- For continuous models, install evalb and configurate values under path
tool:evalb:
in file000/manager.yaml
. - For discontinuous models, install discontinuous DOP and configurate values under path
tool:evalb_lcfrs_prm:
in file000/manager.yaml
.
- CB: continuous and binary (
models/nccp
, +SA) - CM: continuous and multi-branching (
models/accp
, +WS, +NER) - DB: discontinuous binary (
models/dccp
) - DM: discontinuous multi-branching (
models/xccp
)
Besides constituency parsing, continuous models enables SA, WS, and NER. All models can be either monolingual or multilingual.
We provide configuration of models in our two papers
(i.e., CB and CM as file 000/manager.yaml
and DB and DM).
Please first configurate path data:[corpus]:source_path:
for each corpus you have and check additional requirements above.
If a monolingual parser as in our published papers, you might try the following command:
# train DB on corpus DPTB on device GPU ID 0.
./manager.py 000 -s db/dptb -g 0
# give a optional folder name [#.test_me] for storage
./manager.py 000 -s db/dptb:test_me -g 0
For multiligual models, example are:
# train CB on all available corpora (i.e., PTB, CTB, KTB, NPCMJ, and SST (for SA) if configurated) on GPU 0 (using default values).
./manager.py 000 -s cb
# train CM on all available corpora (i.e., PTB, CTB, KTB, NPCMJ, and CONLL & IDNER (for NER) if configurated).
./manager.py 000 -s cm
# train DM on corpora DPTB and TIGER with pre-trained language models on device GPU 4.
./manager.py 000 -s pre_dm/dptb,tiger -g 4
# or
./manager.py 000 -s pre_dm -g 4
Each trained model is stored at 000/[model]/[#.folder_name]
with an entry [#]
in 000/[model]/register_and_tests.yaml
, where [model]
is the model variant and [#]
is an integer for the trained model instance. Number [#]
is assigned by mamager.py
and [folder_name]
is the optional name for storage such as :test_me
in the previous example.
To test, you may try:
./manager 000 -s [model] -i [#]
We suggest tuning hyperparameter with a trained model.
./manager 000 -s [model] -ir [#] -x mp,optuna=[#trials],max=0
If you want to edit the range of hyperparameter explorating, please find a respective file experiments/[model]/operator.py
and modify its function _get_optuna_fn
.
We gave an exemplary illustration from https://github.com/tmu-nlp/nccp.
However, because of few demands for visualization, ./visualization.py
becomes obsoleted.
To convert the Penn Treebank into a graphbank or DPTB:
# Output all trees into one XML file,
./ptb_to.py g path_to_ptb_wsj gptb.xml
# or into separate XML files respecting the wsj folder
./ptb_to.py g path_to_ptb_wsj gptb_folder
# Additionally, following DPTB conversion:
./ptb_to.py d path_to_ptb_wsj dptb.xml
./ptb_to.py d path_to_ptb_wsj dptb_folder