Skip to content

Commit

Permalink
Enhancing Morphological Analysis with spaCy Pretraining (#188)
Browse files Browse the repository at this point in the history
* init

* add commands to project yml

* add language variable

* Add more configs

* add german language

* add nl lang

* start evaluation script

* Finish evaluation script

* code adjustments

* edit eval script

* Adjust description and requirements

* Add install requirements command

* add working_env ignore

* Adjustments

* Fix description

* Update readme

* Adjust benchmark readme

* Add static vector training workflow

* set gpu to -1

* Update with model-last.bin for spacy v3.5.2+

* Add pretraining workflow to tests

* Update README

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
  • Loading branch information
thomashacker and adrianeboyd authored Jul 31, 2023
1 parent 945d81b commit 393e79f
Show file tree
Hide file tree
Showing 16 changed files with 1,390 additions and 0 deletions.
1 change: 1 addition & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
| [`ner_conll03`](ner_conll03) | Named Entity Recognition (CoNLL-2003) |
| [`ner_embeddings`](ner_embeddings) | Comparing embedding layers in spaCy |
| [`parsing_penn_treebank`](parsing_penn_treebank) | Dependency Parsing (Penn Treebank) |
| [`pretraining_morphologizer_oscar`](pretraining_morphologizer_oscar) | Pretraining Morphologizer |
| [`span-labeling-datasets`](span-labeling-datasets) | Span labeling datasets |
| [`speed`](speed) | Project for speed benchmarking of various pretrained models of different NLP libraries. |
| [`textcat_architectures`](textcat_architectures) | Textcat performance benchmarks |
Expand Down
7 changes: 7 additions & 0 deletions benchmarks/pretraining_morphologizer_oscar/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
assets
corpus
data
training
pretraining
metrics
working_env
69 changes: 69 additions & 0 deletions benchmarks/pretraining_morphologizer_oscar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 spaCy Project: Enhancing Morphological Analysis with spaCy Pretraining

This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.

## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `install_requirements` | Download and install all requirements |
| `download_oscar` | Download a subset of the oscar dataset |
| `download_model` | Download the specified spaCy model for vector-objective pretraining |
| `extract_ud` | Extract the ud-treebanks data |
| `convert_ud` | Convert the ud-treebanks data to spaCy's format |
| `train` | Train a morphologizer component without pretrained weights and static vectors |
| `evaluate` | Evaluate the trained morphologizer component without pretrained weights and static vectors |
| `train_static` | Train a morphologizer component with static vectors from a pretrained model |
| `evaluate_static` | Evaluate the trained morphologizer component with static weights |
| `pretrain_char` | Pretrain a tok2vec component with the character objective |
| `train_char` | Train a morphologizer component with pretrained weights (character_objective) |
| `evaluate_char` | Evaluate the trained morphologizer component with pretrained weights (character-objective) |
| `pretrain_vector` | Pretrain a tok2vec component with the vector objective |
| `train_vector` | Train a morphologizer component with pretrained weights (vector_objective) |
| `evaluate_vector` | Evaluate the trained morphologizer component with pretrained weights (vector-objective) |
| `train_trf` | Train a morphologizer component without transformer embeddings |
| `evaluate_trf` | Evaluate the trained morphologizer component with transformer embeddings |
| `evaluate_metrics` | Evaluate all experiments and create a summary json file |
| `reset_project` | Reset the project to its original state and delete all training process |
| `reset_training` | Reset the training progress |
| `reset_metrics` | Delete the metrics folder |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `data` | `download_oscar` &rarr; `download_model` &rarr; `extract_ud` &rarr; `convert_ud` |
| `training` | `train` &rarr; `evaluate` |
| `training_static` | `train_static` &rarr; `evaluate_static` |
| `training_char` | `pretrain_char` &rarr; `train_char` &rarr; `evaluate_char` |
| `training_vector` | `pretrain_vector` &rarr; `train_vector` &rarr; `evaluate_vector` |
| `training_trf` | `train_trf` &rarr; `evaluate_trf` |

### 🗂 Assets

The following assets are defined by the project. They can
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
in the project directory.

| File | Source | Description |
| --- | --- | --- |
| `assets/ud-treebanks-v2.5.tgz` | URL | |

<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->
127 changes: 127 additions & 0 deletions benchmarks/pretraining_morphologizer_oscar/configs/config.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
log_file = null
raw_text = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["morphologizer"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.morphologizer]
factory = "morphologizer"
overwrite = false
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}

[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.morphologizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.morphologizer.model.tok2vec.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

[components.morphologizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 2500
max_epochs = 0
max_steps = 20000
eval_frequency = 250
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
get_length = null
size = 2000
buffer = 256

[training.logger]
@loggers = "spacy.ConsoleLogger.v2"
progress_bar = true
output_file = ${paths.log_file}


[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
log_file = null
raw_text = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["morphologizer"]
batch_size = 64
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.morphologizer]
factory = "morphologizer"
overwrite = false
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}

[components.morphologizer.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.morphologizer.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.morphologizer.model.tok2vec.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true

[components.morphologizer.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = ${paths.raw_text}
min_length = 5
max_length = 500
limit = 0

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 2500
max_epochs = 0
max_steps = 20000
eval_frequency = 500
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
get_length = null
size = 2000
buffer = 256

[training.logger]
@loggers = "spacy.ConsoleLogger.v2"
progress_bar = true
output_file = ${paths.log_file}


[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]

[pretraining]
max_epochs = 1000
dropout = 0.2
n_save_every = 0
n_save_epoch = 1
component = "morphologizer"
layer = "tok2vec"
corpus = "corpora.pretrain"

[pretraining.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 3000
discard_oversize = false
tolerance = 0.2
get_length = null

[pretraining.objective]
@architectures = "spacy.PretrainCharacters.v1"
maxout_pieces = 3
hidden_size = 300
n_characters = 4

[pretraining.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
Loading

0 comments on commit 393e79f

Please sign in to comment.