UniTP

Unified Tokenization and Parsing framework (UniTP) in PyTorch for our two papers in ACL Findings 2021 and TACL 2023. This is Neural Combinatory Constituency Parsing (NCCP) family which also performs addtional word segmentation (WS), sentiment analysis (SA), named entity recoginition (NER).

This project is extended from https://github.com/tmu-nlp/nccp.

Requirements

For models with fastText,

pip install -r requirements/minimal.txt
Install fastText and configurate values under path tool:fasttext: in file 000/manager.yaml.

Additional requirements:

pip install -r requirements/full.txt for NCCP models with huggingface transformers.
For continuous models, install evalb and configurate values under path tool:evalb: in file 000/manager.yaml.
For discontinuous models, install discontinuous DOP and configurate values under path tool:evalb_lcfrs_prm: in file 000/manager.yaml.

Neural Combinatory Constituency Parsing Models

CB: continuous and binary (models/nccp, +SA)
CM: continuous and multi-branching (models/accp, +WS, +NER)
DB: discontinuous binary (models/dccp)
DM: discontinuous multi-branching (models/xccp)

Besides constituency parsing, continuous models enables SA, WS, and NER. All models can be either monolingual or multilingual.

Usage

Train a monolingual model

We provide configuration of models in our two papers (i.e., CB and CM as file 000/manager.yaml and DB and DM). Please first configurate path data:[corpus]:source_path: for each corpus you have and check additional requirements above.

If a monolingual parser as in our published papers, you might try the following command:

# train DB on corpus DPTB on device GPU ID 0.
./manager.py 000 -s db/dptb -g 0

# give a optional folder name [#.test_me] for storage
./manager.py 000 -s db/dptb:test_me -g 0

For multiligual models, example are:

# train CB on all available corpora (i.e., PTB, CTB, KTB, NPCMJ, and SST (for SA) if configurated) on GPU 0 (using default values).
./manager.py 000 -s cb

# train CM on all available corpora (i.e., PTB, CTB, KTB, NPCMJ, and CONLL & IDNER (for NER) if configurated).
./manager.py 000 -s cm

# train DM on corpora DPTB and TIGER with pre-trained language models on device GPU 4.
./manager.py 000 -s pre_dm/dptb,tiger -g 4
# or
./manager.py 000 -s pre_dm -g 4

Test a trained model

Each trained model is stored at 000/[model]/[#.folder_name] with an entry [#] in 000/[model]/register_and_tests.yaml, where [model] is the model variant and [#] is an integer for the trained model instance. Number [#] is assigned by mamager.py and [folder_name] is the optional name for storage such as :test_me in the previous example.

To test, you may try:

./manager 000 -s [model] -i [#]

Hyperparameter tuning

We suggest tuning hyperparameter with a trained model.

./manager 000 -s [model] -ir [#] -x mp,optuna=[#trials],max=0

If you want to edit the range of hyperparameter explorating, please find a respective file experiments/[model]/operator.py and modify its function _get_optuna_fn.

Visualization (not avaliable now)

We gave an exemplary illustration from https://github.com/tmu-nlp/nccp. However, because of few demands for visualization, ./visualization.py becomes obsoleted.

For NLP2023 OKINAWA (言語処理学会第29回年次大会)

To convert the Penn Treebank into a graphbank or DPTB:

# Output all trees into one XML file,
./ptb_to.py g path_to_ptb_wsj gptb.xml

# or into separate XML files respecting the wsj folder
./ptb_to.py g path_to_ptb_wsj gptb_folder

# Additionally, following DPTB conversion:
./ptb_to.py d path_to_ptb_wsj dptb.xml
./ptb_to.py d path_to_ptb_wsj dptb_folder

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
000		000
R_ggplot		R_ggplot
cuda		cuda
data		data
experiments		experiments
models		models
requirements		requirements
utils		utils
README.md		README.md
discodop.prm		discodop.prm
manager.py		manager.py
ptb_to.py		ptb_to.py
stat.compress.ratio.py		stat.compress.ratio.py
stat.data.py		stat.data.py
stat.optuna.py		stat.optuna.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniTP

Requirements

Neural Combinatory Constituency Parsing Models

Usage

Train a monolingual model

Test a trained model

Hyperparameter tuning

Visualization (not avaliable now)

For NLP2023 OKINAWA (言語処理学会第29回年次大会)

About

Releases

Packages

Languages

tmu-nlp/UniTP

Folders and files

Latest commit

History

Repository files navigation

UniTP

Requirements

Neural Combinatory Constituency Parsing Models

Usage

Train a monolingual model

Test a trained model

Hyperparameter tuning

Visualization (not avaliable now)

For NLP2023 OKINAWA (言語処理学会第29回年次大会)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages