-
-
Notifications
You must be signed in to change notification settings - Fork 467
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhancing Morphological Analysis with spaCy Pretraining (#188)
* init * add commands to project yml * add language variable * Add more configs * add german language * add nl lang * start evaluation script * Finish evaluation script * code adjustments * edit eval script * Adjust description and requirements * Add install requirements command * add working_env ignore * Adjustments * Fix description * Update readme * Adjust benchmark readme * Add static vector training workflow * set gpu to -1 * Update with model-last.bin for spacy v3.5.2+ * Add pretraining workflow to tests * Update README --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
- Loading branch information
1 parent
945d81b
commit 393e79f
Showing
16 changed files
with
1,390 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
assets | ||
corpus | ||
data | ||
training | ||
pretraining | ||
metrics | ||
working_env |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) --> | ||
|
||
# 🪐 spaCy Project: Enhancing Morphological Analysis with spaCy Pretraining | ||
|
||
This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining. | ||
|
||
## 📋 project.yml | ||
|
||
The [`project.yml`](project.yml) defines the data assets required by the | ||
project, as well as the available commands and workflows. For details, see the | ||
[spaCy projects documentation](https://spacy.io/usage/projects). | ||
|
||
### ⏯ Commands | ||
|
||
The following commands are defined by the project. They | ||
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run). | ||
Commands are only re-run if their inputs have changed. | ||
|
||
| Command | Description | | ||
| --- | --- | | ||
| `install_requirements` | Download and install all requirements | | ||
| `download_oscar` | Download a subset of the oscar dataset | | ||
| `download_model` | Download the specified spaCy model for vector-objective pretraining | | ||
| `extract_ud` | Extract the ud-treebanks data | | ||
| `convert_ud` | Convert the ud-treebanks data to spaCy's format | | ||
| `train` | Train a morphologizer component without pretrained weights and static vectors | | ||
| `evaluate` | Evaluate the trained morphologizer component without pretrained weights and static vectors | | ||
| `train_static` | Train a morphologizer component with static vectors from a pretrained model | | ||
| `evaluate_static` | Evaluate the trained morphologizer component with static weights | | ||
| `pretrain_char` | Pretrain a tok2vec component with the character objective | | ||
| `train_char` | Train a morphologizer component with pretrained weights (character_objective) | | ||
| `evaluate_char` | Evaluate the trained morphologizer component with pretrained weights (character-objective) | | ||
| `pretrain_vector` | Pretrain a tok2vec component with the vector objective | | ||
| `train_vector` | Train a morphologizer component with pretrained weights (vector_objective) | | ||
| `evaluate_vector` | Evaluate the trained morphologizer component with pretrained weights (vector-objective) | | ||
| `train_trf` | Train a morphologizer component without transformer embeddings | | ||
| `evaluate_trf` | Evaluate the trained morphologizer component with transformer embeddings | | ||
| `evaluate_metrics` | Evaluate all experiments and create a summary json file | | ||
| `reset_project` | Reset the project to its original state and delete all training process | | ||
| `reset_training` | Reset the training progress | | ||
| `reset_metrics` | Delete the metrics folder | | ||
|
||
### ⏭ Workflows | ||
|
||
The following workflows are defined by the project. They | ||
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run) | ||
and will run the specified commands in order. Commands are only re-run if their | ||
inputs have changed. | ||
|
||
| Workflow | Steps | | ||
| --- | --- | | ||
| `data` | `download_oscar` → `download_model` → `extract_ud` → `convert_ud` | | ||
| `training` | `train` → `evaluate` | | ||
| `training_static` | `train_static` → `evaluate_static` | | ||
| `training_char` | `pretrain_char` → `train_char` → `evaluate_char` | | ||
| `training_vector` | `pretrain_vector` → `train_vector` → `evaluate_vector` | | ||
| `training_trf` | `train_trf` → `evaluate_trf` | | ||
|
||
### 🗂 Assets | ||
|
||
The following assets are defined by the project. They can | ||
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets) | ||
in the project directory. | ||
|
||
| File | Source | Description | | ||
| --- | --- | --- | | ||
| `assets/ud-treebanks-v2.5.tgz` | URL | | | ||
|
||
<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) --> |
127 changes: 127 additions & 0 deletions
127
benchmarks/pretraining_morphologizer_oscar/configs/config.cfg
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
[paths] | ||
train = null | ||
dev = null | ||
vectors = null | ||
init_tok2vec = null | ||
log_file = null | ||
raw_text = null | ||
|
||
[system] | ||
gpu_allocator = "pytorch" | ||
seed = 0 | ||
|
||
[nlp] | ||
lang = "en" | ||
pipeline = ["morphologizer"] | ||
batch_size = 64 | ||
disabled = [] | ||
before_creation = null | ||
after_creation = null | ||
after_pipeline_creation = null | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
|
||
[components] | ||
|
||
[components.morphologizer] | ||
factory = "morphologizer" | ||
overwrite = false | ||
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"} | ||
|
||
[components.morphologizer.model] | ||
@architectures = "spacy.Tagger.v1" | ||
nO = null | ||
|
||
[components.morphologizer.model.tok2vec] | ||
@architectures = "spacy.Tok2Vec.v2" | ||
|
||
[components.morphologizer.model.tok2vec.embed] | ||
@architectures = "spacy.MultiHashEmbed.v2" | ||
width = ${components.morphologizer.model.tok2vec.encode.width} | ||
attrs = ["ORTH", "SHAPE"] | ||
rows = [5000, 2500] | ||
include_static_vectors = false | ||
|
||
[components.morphologizer.model.tok2vec.encode] | ||
@architectures = "spacy.MaxoutWindowEncoder.v2" | ||
width = 256 | ||
depth = 8 | ||
window_size = 1 | ||
maxout_pieces = 3 | ||
|
||
[corpora] | ||
|
||
[corpora.dev] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.dev} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[corpora.train] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.train} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[training] | ||
train_corpus = "corpora.train" | ||
dev_corpus = "corpora.dev" | ||
seed = ${system:seed} | ||
gpu_allocator = ${system:gpu_allocator} | ||
dropout = 0.1 | ||
accumulate_gradient = 3 | ||
patience = 2500 | ||
max_epochs = 0 | ||
max_steps = 20000 | ||
eval_frequency = 250 | ||
frozen_components = [] | ||
before_to_disk = null | ||
annotating_components = [] | ||
|
||
[training.batcher] | ||
@batchers = "spacy.batch_by_padded.v1" | ||
discard_oversize = true | ||
get_length = null | ||
size = 2000 | ||
buffer = 256 | ||
|
||
[training.logger] | ||
@loggers = "spacy.ConsoleLogger.v2" | ||
progress_bar = true | ||
output_file = ${paths.log_file} | ||
|
||
|
||
[training.optimizer] | ||
@optimizers = "Adam.v1" | ||
beta1 = 0.9 | ||
beta2 = 0.999 | ||
L2_is_weight_decay = true | ||
L2 = 0.01 | ||
grad_clip = 1.0 | ||
use_averages = true | ||
eps = 0.00000001 | ||
|
||
[training.optimizer.learn_rate] | ||
@schedules = "warmup_linear.v1" | ||
warmup_steps = 250 | ||
total_steps = 20000 | ||
initial_rate = 0.00005 | ||
|
||
[training.score_weights] | ||
|
||
[pretraining] | ||
|
||
[initialize] | ||
vectors = ${paths.vectors} | ||
init_tok2vec = ${paths.init_tok2vec} | ||
vocab_data = null | ||
lookups = null | ||
before_init = null | ||
after_init = null | ||
|
||
[initialize.components] | ||
|
||
[initialize.tokenizer] |
165 changes: 165 additions & 0 deletions
165
benchmarks/pretraining_morphologizer_oscar/configs/config_pretrain_char.cfg
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
[paths] | ||
train = null | ||
dev = null | ||
vectors = null | ||
init_tok2vec = null | ||
log_file = null | ||
raw_text = null | ||
|
||
[system] | ||
gpu_allocator = "pytorch" | ||
seed = 0 | ||
|
||
[nlp] | ||
lang = "en" | ||
pipeline = ["morphologizer"] | ||
batch_size = 64 | ||
disabled = [] | ||
before_creation = null | ||
after_creation = null | ||
after_pipeline_creation = null | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
|
||
[components] | ||
|
||
[components.morphologizer] | ||
factory = "morphologizer" | ||
overwrite = false | ||
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"} | ||
|
||
[components.morphologizer.model] | ||
@architectures = "spacy.Tagger.v1" | ||
nO = null | ||
|
||
[components.morphologizer.model.tok2vec] | ||
@architectures = "spacy.Tok2Vec.v2" | ||
|
||
[components.morphologizer.model.tok2vec.embed] | ||
@architectures = "spacy.MultiHashEmbed.v2" | ||
width = ${components.morphologizer.model.tok2vec.encode.width} | ||
attrs = ["ORTH", "SHAPE"] | ||
rows = [5000, 2500] | ||
include_static_vectors = true | ||
|
||
[components.morphologizer.model.tok2vec.encode] | ||
@architectures = "spacy.MaxoutWindowEncoder.v2" | ||
width = 256 | ||
depth = 8 | ||
window_size = 1 | ||
maxout_pieces = 3 | ||
|
||
[corpora] | ||
|
||
[corpora.dev] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.dev} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[corpora.train] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.train} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[corpora.pretrain] | ||
@readers = "spacy.JsonlCorpus.v1" | ||
path = ${paths.raw_text} | ||
min_length = 5 | ||
max_length = 500 | ||
limit = 0 | ||
|
||
[training] | ||
train_corpus = "corpora.train" | ||
dev_corpus = "corpora.dev" | ||
seed = ${system:seed} | ||
gpu_allocator = ${system:gpu_allocator} | ||
dropout = 0.1 | ||
accumulate_gradient = 3 | ||
patience = 2500 | ||
max_epochs = 0 | ||
max_steps = 20000 | ||
eval_frequency = 500 | ||
frozen_components = [] | ||
before_to_disk = null | ||
annotating_components = [] | ||
|
||
[training.batcher] | ||
@batchers = "spacy.batch_by_padded.v1" | ||
discard_oversize = true | ||
get_length = null | ||
size = 2000 | ||
buffer = 256 | ||
|
||
[training.logger] | ||
@loggers = "spacy.ConsoleLogger.v2" | ||
progress_bar = true | ||
output_file = ${paths.log_file} | ||
|
||
|
||
[training.optimizer] | ||
@optimizers = "Adam.v1" | ||
beta1 = 0.9 | ||
beta2 = 0.999 | ||
L2_is_weight_decay = true | ||
L2 = 0.01 | ||
grad_clip = 1.0 | ||
use_averages = true | ||
eps = 0.00000001 | ||
|
||
[training.optimizer.learn_rate] | ||
@schedules = "warmup_linear.v1" | ||
warmup_steps = 250 | ||
total_steps = 20000 | ||
initial_rate = 0.00005 | ||
|
||
[training.score_weights] | ||
|
||
[pretraining] | ||
max_epochs = 1000 | ||
dropout = 0.2 | ||
n_save_every = 0 | ||
n_save_epoch = 1 | ||
component = "morphologizer" | ||
layer = "tok2vec" | ||
corpus = "corpora.pretrain" | ||
|
||
[pretraining.batcher] | ||
@batchers = "spacy.batch_by_words.v1" | ||
size = 3000 | ||
discard_oversize = false | ||
tolerance = 0.2 | ||
get_length = null | ||
|
||
[pretraining.objective] | ||
@architectures = "spacy.PretrainCharacters.v1" | ||
maxout_pieces = 3 | ||
hidden_size = 300 | ||
n_characters = 4 | ||
|
||
[pretraining.optimizer] | ||
@optimizers = "Adam.v1" | ||
beta1 = 0.9 | ||
beta2 = 0.999 | ||
L2_is_weight_decay = true | ||
L2 = 0.01 | ||
grad_clip = 1.0 | ||
use_averages = true | ||
eps = 1e-8 | ||
learn_rate = 0.001 | ||
|
||
[initialize] | ||
vectors = ${paths.vectors} | ||
init_tok2vec = ${paths.init_tok2vec} | ||
vocab_data = null | ||
lookups = null | ||
before_init = null | ||
after_init = null | ||
|
||
[initialize.components] | ||
|
||
[initialize.tokenizer] |
Oops, something went wrong.