Enhancing Morphological Analysis with spaCy Pretraining (#188)

* init * add commands to project yml * add language variable * Add more configs * add german language * add nl lang * start evaluation script * Finish evaluation script * code adjustments * edit eval script * Adjust description and requirements * Add install requirements command * add working_env ignore * Adjustments * Fix description * Update readme * Adjust benchmark readme * Add static vector training workflow * set gpu to -1 * Update with model-last.bin for spacy v3.5.2+ * Add pretraining workflow to tests * Update README --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
explosion · Jul 31, 2023 · 393e79f · 393e79f
1 parent 945d81b
commit 393e79f
Show file tree

Hide file tree

Showing 16 changed files with 1,390 additions and 0 deletions.
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -9,6 +9,7 @@
 | [`ner_conll03`](ner_conll03) | Named Entity Recognition (CoNLL-2003) |
 | [`ner_embeddings`](ner_embeddings) | Comparing embedding layers in spaCy |
 | [`parsing_penn_treebank`](parsing_penn_treebank) | Dependency Parsing (Penn Treebank) |
+| [`pretraining_morphologizer_oscar`](pretraining_morphologizer_oscar) | Pretraining Morphologizer |
 | [`span-labeling-datasets`](span-labeling-datasets) | Span labeling datasets |
 | [`speed`](speed) | Project for speed benchmarking of various pretrained models of different NLP libraries. |
 | [`textcat_architectures`](textcat_architectures) | Textcat performance benchmarks |

diff --git a/benchmarks/pretraining_morphologizer_oscar/.gitignore b/benchmarks/pretraining_morphologizer_oscar/.gitignore
@@ -0,0 +1,7 @@
+assets
+corpus
+data
+training
+pretraining
+metrics
+working_env
diff --git a/benchmarks/pretraining_morphologizer_oscar/README.md b/benchmarks/pretraining_morphologizer_oscar/README.md
@@ -0,0 +1,69 @@
+<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->
+
+# 🪐 spaCy Project: Enhancing Morphological Analysis with spaCy Pretraining
+
+This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.
+
+## 📋 project.yml
+
+The [`project.yml`](project.yml) defines the data assets required by the
+project, as well as the available commands and workflows. For details, see the
+[spaCy projects documentation](https://spacy.io/usage/projects).
+
+### ⏯ Commands
+
+The following commands are defined by the project. They
+can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
+Commands are only re-run if their inputs have changed.
+
+| Command | Description |
+| --- | --- |
+| `install_requirements` | Download and install all requirements |
+| `download_oscar` | Download a subset of the oscar dataset |
+| `download_model` | Download the specified spaCy model for vector-objective pretraining |
+| `extract_ud` | Extract the ud-treebanks data |
+| `convert_ud` | Convert the ud-treebanks data to spaCy's format |
+| `train` | Train a morphologizer component without pretrained weights and static vectors |
+| `evaluate` | Evaluate the trained morphologizer component without pretrained weights and static vectors |
+| `train_static` | Train a morphologizer component with static vectors from a pretrained model |
+| `evaluate_static` | Evaluate the trained morphologizer component with static weights |
+| `pretrain_char` | Pretrain a tok2vec component with the character objective |
+| `train_char` | Train a morphologizer component with pretrained weights (character_objective) |
+| `evaluate_char` | Evaluate the trained morphologizer component with pretrained weights (character-objective) |
+| `pretrain_vector` | Pretrain a tok2vec component with the vector objective |
+| `train_vector` | Train a morphologizer component with pretrained weights (vector_objective) |
+| `evaluate_vector` | Evaluate the trained morphologizer component with pretrained weights (vector-objective) |
+| `train_trf` | Train a morphologizer component without transformer embeddings |
+| `evaluate_trf` | Evaluate the trained morphologizer component with transformer embeddings |
+| `evaluate_metrics` | Evaluate all experiments and create a summary json file |
+| `reset_project` | Reset the project to its original state and delete all training process |
+| `reset_training` | Reset the training progress |
+| `reset_metrics` | Delete the metrics folder |
+
+### ⏭ Workflows
+
+The following workflows are defined by the project. They
+can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
+and will run the specified commands in order. Commands are only re-run if their
+inputs have changed.
+
+| Workflow | Steps |
+| --- | --- |
+| `data` | `download_oscar` &rarr; `download_model` &rarr; `extract_ud` &rarr; `convert_ud` |
+| `training` | `train` &rarr; `evaluate` |
+| `training_static` | `train_static` &rarr; `evaluate_static` |
+| `training_char` | `pretrain_char` &rarr; `train_char` &rarr; `evaluate_char` |
+| `training_vector` | `pretrain_vector` &rarr; `train_vector` &rarr; `evaluate_vector` |
+| `training_trf` | `train_trf` &rarr; `evaluate_trf` |
+
+### 🗂 Assets
+
+The following assets are defined by the project. They can
+be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
+in the project directory.
+
+| File | Source | Description |
+| --- | --- | --- |
+| `assets/ud-treebanks-v2.5.tgz` | URL |  |
+
+<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->
diff --git a/benchmarks/pretraining_morphologizer_oscar/configs/config.cfg b/benchmarks/pretraining_morphologizer_oscar/configs/config.cfg
@@ -0,0 +1,127 @@
+[paths]
+train = null
+dev = null
+vectors = null
+init_tok2vec = null
+log_file = null
+raw_text = null
+
+[system]
+gpu_allocator = "pytorch"
+seed = 0
+
+[nlp]
+lang = "en"
+pipeline = ["morphologizer"]
+batch_size = 64
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+
+[components]
+
+[components.morphologizer]
+factory = "morphologizer"
+overwrite = false
+scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}
+
+[components.morphologizer.model]
+@architectures = "spacy.Tagger.v1"
+nO = null
+
+[components.morphologizer.model.tok2vec]
+@architectures = "spacy.Tok2Vec.v2"
+
+[components.morphologizer.model.tok2vec.embed]
+@architectures = "spacy.MultiHashEmbed.v2"
+width = ${components.morphologizer.model.tok2vec.encode.width}
+attrs = ["ORTH", "SHAPE"]
+rows = [5000, 2500]
+include_static_vectors = false
+
+[components.morphologizer.model.tok2vec.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v2"
+width = 256
+depth = 8
+window_size = 1
+maxout_pieces = 3
+
+[corpora]
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths.dev}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths.train}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[training]
+train_corpus = "corpora.train"
+dev_corpus = "corpora.dev"
+seed = ${system:seed}
+gpu_allocator = ${system:gpu_allocator}
+dropout = 0.1
+accumulate_gradient = 3
+patience = 2500
+max_epochs = 0
+max_steps = 20000
+eval_frequency = 250
+frozen_components = []
+before_to_disk = null
+annotating_components = []
+
+[training.batcher]
+@batchers = "spacy.batch_by_padded.v1"
+discard_oversize = true
+get_length = null
+size = 2000
+buffer = 256
+
+[training.logger]
+@loggers = "spacy.ConsoleLogger.v2"
+progress_bar = true
+output_file = ${paths.log_file}
+
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = true
+eps = 0.00000001
+
+[training.optimizer.learn_rate]
+@schedules = "warmup_linear.v1"
+warmup_steps = 250
+total_steps = 20000
+initial_rate = 0.00005
+
+[training.score_weights]
+
+[pretraining]
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+before_init = null
+after_init = null
+
+[initialize.components]
+
+[initialize.tokenizer]
diff --git a/benchmarks/pretraining_morphologizer_oscar/configs/config_pretrain_char.cfg b/benchmarks/pretraining_morphologizer_oscar/configs/config_pretrain_char.cfg
@@ -0,0 +1,165 @@
+[paths]
+train = null
+dev = null
+vectors = null
+init_tok2vec = null
+log_file = null
+raw_text = null
+
+[system]
+gpu_allocator = "pytorch"
+seed = 0
+
+[nlp]
+lang = "en"
+pipeline = ["morphologizer"]
+batch_size = 64
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+
+[components]
+
+[components.morphologizer]
+factory = "morphologizer"
+overwrite = false
+scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}
+
+[components.morphologizer.model]
+@architectures = "spacy.Tagger.v1"
+nO = null
+
+[components.morphologizer.model.tok2vec]
+@architectures = "spacy.Tok2Vec.v2"
+
+[components.morphologizer.model.tok2vec.embed]
+@architectures = "spacy.MultiHashEmbed.v2"
+width = ${components.morphologizer.model.tok2vec.encode.width}
+attrs = ["ORTH", "SHAPE"]
+rows = [5000, 2500]
+include_static_vectors = true
+
+[components.morphologizer.model.tok2vec.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v2"
+width = 256
+depth = 8
+window_size = 1
+maxout_pieces = 3
+
+[corpora]
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths.dev}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths.train}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.pretrain]
+@readers = "spacy.JsonlCorpus.v1"
+path = ${paths.raw_text}
+min_length = 5
+max_length = 500
+limit = 0
+
+[training]
+train_corpus = "corpora.train"
+dev_corpus = "corpora.dev"
+seed = ${system:seed}
+gpu_allocator = ${system:gpu_allocator}
+dropout = 0.1
+accumulate_gradient = 3
+patience = 2500
+max_epochs = 0
+max_steps = 20000
+eval_frequency = 500
+frozen_components = []
+before_to_disk = null
+annotating_components = []
+
+[training.batcher]
+@batchers = "spacy.batch_by_padded.v1"
+discard_oversize = true
+get_length = null
+size = 2000
+buffer = 256
+
+[training.logger]
+@loggers = "spacy.ConsoleLogger.v2"
+progress_bar = true
+output_file = ${paths.log_file}
+
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = true
+eps = 0.00000001
+
+[training.optimizer.learn_rate]
+@schedules = "warmup_linear.v1"
+warmup_steps = 250
+total_steps = 20000
+initial_rate = 0.00005
+
+[training.score_weights]
+
+[pretraining]
+max_epochs = 1000
+dropout = 0.2
+n_save_every = 0
+n_save_epoch = 1
+component = "morphologizer"
+layer = "tok2vec"
+corpus = "corpora.pretrain"
+
+[pretraining.batcher]
+@batchers = "spacy.batch_by_words.v1"
+size = 3000
+discard_oversize = false
+tolerance = 0.2
+get_length = null
+
+[pretraining.objective]
+@architectures = "spacy.PretrainCharacters.v1"
+maxout_pieces = 3
+hidden_size = 300
+n_characters = 4
+
+[pretraining.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = true
+eps = 1e-8
+learn_rate = 0.001
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+before_init = null
+after_init = null
+
+[initialize.components]
+
+[initialize.tokenizer]