Enhancing Morphological Analysis with spaCy Pretraining #188

thomashacker · 2023-03-15T14:40:50Z

Description

This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.

Types of change

New project

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
I ran the update scripts in the .github folder, and all the configs and docs are up-to-date.

thomashacker · 2023-03-15T14:43:38Z

This project was finished some time ago, but we wanted to publish it with a post about pretraining. However, we never had time to do the post due to other priorities. It would make more sense to publish the project first and work on the blog post later.

For the experiments, I think it would be good to set up some GPU VMs to get some good statistics on how much pretraining can boost performance. My local experiments already indicated such a boost.

adrianeboyd · 2023-03-16T10:16:05Z

Very nice! I'm testing it out a bit looking at supporting floret vectors.

Because I was mainly looking at vectors, I was expecting to see a comparison for vectors vs. pretraining+vectors, but training with just vectors doesn't seem to be included?

adrianeboyd · 2023-03-16T15:48:46Z

Just a note that the pretraining+training workflows could use a few config adjustments. I think having the number of pretraining epochs for each type of pretraining as a setting would make it possible to run the pretraining+training workflows without having to specify the pretraining model name. (It's possible there's some annoying off-by-1 math/naming involved, or maybe you could add a script that picks the largest filename from the directory for the config override.)

adrianeboyd · 2023-03-17T11:07:29Z

There is annoying off-by-one naming for sure, argh.

thomashacker · 2023-03-23T13:44:53Z

I've added a workflow for training the morphologizer with only static vectors, this makes 5 different combinations (for every of the three languages):

No pretraining, No static vectors
Only static vectors
Character Objective Pretraining
Vector Objective Pretraining
Transformer

I've ran all experiments for the English language with my GPU. Here are the results for future reference:

Label	no_pretraining	static	character_objective	vector_objective	transformer
pos_acc	0.90 (0.00)	0.94 (0.04)	0.90 (0.00)	0.94 (0.04)	0.97 (0.07)
morph_micro_p	0.92 (0.00)	0.96 (0.04)	0.92 (0.00)	0.96 (0.04)	0.98 (0.06)
morph_micro_f	0.91 (0.00)	0.95 (0.04)	0.91 (-0.00)	0.95 (0.04)	0.98 (0.06)
morph_micro_r	0.90 (0.00)	0.95 (0.04)	0.90 (-0.00)	0.95 (0.04)	0.97 (0.07)
PronType	0.98 (0.00)	0.99 (0.00)	0.98 (0.00)	0.99 (0.00)	0.99 (0.01)
Number	0.89 (0.00)	0.95 (0.06)	0.89 (-0.00)	0.96 (0.06)	0.98 (0.09)
Mood	0.90 (0.00)	0.94 (0.04)	0.90 (0.00)	0.93 (0.04)	0.97 (0.08)
Tense	0.88 (0.00)	0.95 (0.07)	0.88 (0.01)	0.95 (0.07)	0.97 (0.10)
VerbForm	0.88 (0.00)	0.93 (0.05)	0.88 (0.00)	0.93 (0.05)	0.97 (0.09)
Gender	0.98 (0.00)	0.99 (0.00)	0.98 (0.00)	0.98 (0.00)	0.98 (0.00)
Person	0.97 (0.00)	0.98 (0.01)	0.97 (0.00)	0.98 (0.01)	0.99 (0.02)
Poss	0.99 (0.00)	1.00 (0.00)	0.99 (0.00)	1.00 (0.00)	1.00 (0.00)
Definite	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
Degree	0.81 (0.00)	0.92 (0.11)	0.81 (-0.00)	0.91 (0.10)	0.96 (0.15)
Case	0.96 (0.00)	0.96 (0.00)	0.96 (-0.00)	0.96 (0.01)	0.97 (0.01)
NumType	0.90 (0.00)	0.90 (-0.00)	0.89 (-0.01)	0.90 (-0.00)	0.94 (0.04)
Voice	0.64 (0.00)	0.70 (0.05)	0.62 (-0.03)	0.70 (0.05)	0.74 (0.10)
Typo	0.00 (0.00)	0.05 (0.05)	0.00 (0.00)	0.05 (0.05)	0.34 (0.34)
Abbr	0.13 (0.00)	0.00 (-0.13)	0.13 (0.00)	0.35 (0.22)	0.57 (0.44)
Foreign	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
Reflex	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
speed	38805.62 (0.00)	40341.50 (1535.88)	43037.18 (4231.56)	40930.17 (2124.55)	6519.16 (-32286.46)

thomashacker · 2023-03-23T13:50:22Z

Additionally, I've fixed some bugs and adjusted the config settings. However, as you mentioned, there's still a problem with the pretrained weights variable in the project.yml. I've written a script that retrieves the model filename with the highest number and writes it to the variable, however when the modified project.yml is saved, all formatting gets lost. There are possible solutions such as ruamel.yaml, but I'm not sure if we should pursue this solution any further? I'm also not sure if it's possible to do any mathematical operations inside a .yaml file to fix the one-naming-of thing.

adrianeboyd · 2023-03-24T10:09:56Z

Interesting results! This makes it look like the pretraining doesn't help much, but English morphology is kind of boring, so I think we need to expand the evaluation to other languages/tasks before coming to much of a conclusion. In most of my quick (not careful) tests with floret vectors there did seem to be more differences in the final performance. We'd probably also want to test with vectors that are not trained on the same texts as are used for the pretraining.

thomashacker · 2023-03-31T09:02:25Z

I agree; I think more experiments in the future will be more helpful to see how much pretraining actually influences training. However, I think this project is still a great reference for users to see how to get the different pretraining objectives working, with the workflows being a good starting point. I hope this explosion/spaCy#12459 PR will make the use of the project a bit easier by not manually adjusting the model.bin name in the project.yml.

adrianeboyd · 2023-07-18T13:54:15Z

This needs to be updated to require a newer version of spacy with the model-last.bin option for spacy pretrain.

…scar_morphologizer

adrianeboyd · 2023-07-31T09:50:10Z

Thanks again for your work @thomashacker!

thomashacker and others added 18 commits July 26, 2022 11:59

init

53b16d9

add commands to project yml

857db32

add language variable

18666d2

Add more configs

790b689

add german language

d975a5b

add nl lang

1a77cbf

start evaluation script

51e2d12

Finish evaluation script

439a300

code adjustments

f8d3afc

edit eval script

6bd6ab8

Adjust description and requirements

432ddbf

Merge branch 'v3' into feature/pretraining_oscar_morphologizer

8ffe296

Add install requirements command

122ce70

add working_env ignore

94125dd

Adjustments

b7cbe08

Fix description

eec6da0

Update readme

f9385b4

Adjust benchmark readme

86406ae

thomashacker added the enhancement New feature or request label Mar 15, 2023

thomashacker added 2 commits March 23, 2023 14:41

Add static vector training workflow

602ea71

set gpu to -1

2c585ed

adrianeboyd added 4 commits July 31, 2023 11:00

Update with model-last.bin for spacy v3.5.2+

c4342e0

Merge remote-tracking branch 'upstream/v3' into feature/pretraining_o…

5078297

…scar_morphologizer

Add pretraining workflow to tests

ee6cd2a

Update README

c25ba33

adrianeboyd force-pushed the feature/pretraining_oscar_morphologizer branch from a3f0d29 to c25ba33 Compare July 31, 2023 09:34

adrianeboyd merged commit 393e79f into explosion:v3 Jul 31, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Morphological Analysis with spaCy Pretraining #188

Enhancing Morphological Analysis with spaCy Pretraining #188

thomashacker commented Mar 15, 2023

thomashacker commented Mar 15, 2023 •

edited

Loading

adrianeboyd commented Mar 16, 2023

adrianeboyd commented Mar 16, 2023

adrianeboyd commented Mar 17, 2023

thomashacker commented Mar 23, 2023

thomashacker commented Mar 23, 2023

adrianeboyd commented Mar 24, 2023

thomashacker commented Mar 31, 2023

adrianeboyd commented Jul 18, 2023

adrianeboyd commented Jul 31, 2023

Enhancing Morphological Analysis with spaCy Pretraining #188

Enhancing Morphological Analysis with spaCy Pretraining #188

Conversation

thomashacker commented Mar 15, 2023

Description

Types of change

Checklist

thomashacker commented Mar 15, 2023 • edited Loading

adrianeboyd commented Mar 16, 2023

adrianeboyd commented Mar 16, 2023

adrianeboyd commented Mar 17, 2023

thomashacker commented Mar 23, 2023

thomashacker commented Mar 23, 2023

adrianeboyd commented Mar 24, 2023

thomashacker commented Mar 31, 2023

adrianeboyd commented Jul 18, 2023

adrianeboyd commented Jul 31, 2023

thomashacker commented Mar 15, 2023 •

edited

Loading