Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Morphological Analysis with spaCy Pretraining #188

Merged

Conversation

thomashacker
Copy link
Contributor

Description

This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.

Types of change

New project

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • I ran the update scripts in the .github folder, and all the configs and docs are up-to-date.

@thomashacker thomashacker added the enhancement New feature or request label Mar 15, 2023
@thomashacker
Copy link
Contributor Author

thomashacker commented Mar 15, 2023

This project was finished some time ago, but we wanted to publish it with a post about pretraining. However, we never had time to do the post due to other priorities. It would make more sense to publish the project first and work on the blog post later.

For the experiments, I think it would be good to set up some GPU VMs to get some good statistics on how much pretraining can boost performance. My local experiments already indicated such a boost.

@adrianeboyd
Copy link
Contributor

Very nice! I'm testing it out a bit looking at supporting floret vectors.

Because I was mainly looking at vectors, I was expecting to see a comparison for vectors vs. pretraining+vectors, but training with just vectors doesn't seem to be included?

@adrianeboyd
Copy link
Contributor

Just a note that the pretraining+training workflows could use a few config adjustments. I think having the number of pretraining epochs for each type of pretraining as a setting would make it possible to run the pretraining+training workflows without having to specify the pretraining model name. (It's possible there's some annoying off-by-1 math/naming involved, or maybe you could add a script that picks the largest filename from the directory for the config override.)

@adrianeboyd
Copy link
Contributor

There is annoying off-by-one naming for sure, argh.

@thomashacker
Copy link
Contributor Author

I've added a workflow for training the morphologizer with only static vectors, this makes 5 different combinations (for every of the three languages):

  • No pretraining, No static vectors
  • Only static vectors
  • Character Objective Pretraining
  • Vector Objective Pretraining
  • Transformer

I've ran all experiments for the English language with my GPU. Here are the results for future reference:

Label no_pretraining static character_objective vector_objective transformer
pos_acc 0.90 (0.00) 0.94 (0.04) 0.90 (0.00) 0.94 (0.04) 0.97 (0.07)
morph_micro_p 0.92 (0.00) 0.96 (0.04) 0.92 (0.00) 0.96 (0.04) 0.98 (0.06)
morph_micro_f 0.91 (0.00) 0.95 (0.04) 0.91 (-0.00) 0.95 (0.04) 0.98 (0.06)
morph_micro_r 0.90 (0.00) 0.95 (0.04) 0.90 (-0.00) 0.95 (0.04) 0.97 (0.07)
PronType 0.98 (0.00) 0.99 (0.00) 0.98 (0.00) 0.99 (0.00) 0.99 (0.01)
Number 0.89 (0.00) 0.95 (0.06) 0.89 (-0.00) 0.96 (0.06) 0.98 (0.09)
Mood 0.90 (0.00) 0.94 (0.04) 0.90 (0.00) 0.93 (0.04) 0.97 (0.08)
Tense 0.88 (0.00) 0.95 (0.07) 0.88 (0.01) 0.95 (0.07) 0.97 (0.10)
VerbForm 0.88 (0.00) 0.93 (0.05) 0.88 (0.00) 0.93 (0.05) 0.97 (0.09)
Gender 0.98 (0.00) 0.99 (0.00) 0.98 (0.00) 0.98 (0.00) 0.98 (0.00)
Person 0.97 (0.00) 0.98 (0.01) 0.97 (0.00) 0.98 (0.01) 0.99 (0.02)
Poss 0.99 (0.00) 1.00 (0.00) 0.99 (0.00) 1.00 (0.00) 1.00 (0.00)
Definite 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
Degree 0.81 (0.00) 0.92 (0.11) 0.81 (-0.00) 0.91 (0.10) 0.96 (0.15)
Case 0.96 (0.00) 0.96 (0.00) 0.96 (-0.00) 0.96 (0.01) 0.97 (0.01)
NumType 0.90 (0.00) 0.90 (-0.00) 0.89 (-0.01) 0.90 (-0.00) 0.94 (0.04)
Voice 0.64 (0.00) 0.70 (0.05) 0.62 (-0.03) 0.70 (0.05) 0.74 (0.10)
Typo 0.00 (0.00) 0.05 (0.05) 0.00 (0.00) 0.05 (0.05) 0.34 (0.34)
Abbr 0.13 (0.00) 0.00 (-0.13) 0.13 (0.00) 0.35 (0.22) 0.57 (0.44)
Foreign 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
Reflex 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
speed 38805.62 (0.00) 40341.50 (1535.88) 43037.18 (4231.56) 40930.17 (2124.55) 6519.16 (-32286.46)

UD_English-EWT_training_graph

@thomashacker
Copy link
Contributor Author

Additionally, I've fixed some bugs and adjusted the config settings. However, as you mentioned, there's still a problem with the pretrained weights variable in the project.yml. I've written a script that retrieves the model filename with the highest number and writes it to the variable, however when the modified project.yml is saved, all formatting gets lost. There are possible solutions such as ruamel.yaml, but I'm not sure if we should pursue this solution any further? I'm also not sure if it's possible to do any mathematical operations inside a .yaml file to fix the one-naming-of thing.

@adrianeboyd
Copy link
Contributor

Interesting results! This makes it look like the pretraining doesn't help much, but English morphology is kind of boring, so I think we need to expand the evaluation to other languages/tasks before coming to much of a conclusion. In most of my quick (not careful) tests with floret vectors there did seem to be more differences in the final performance. We'd probably also want to test with vectors that are not trained on the same texts as are used for the pretraining.

@thomashacker
Copy link
Contributor Author

I agree; I think more experiments in the future will be more helpful to see how much pretraining actually influences training. However, I think this project is still a great reference for users to see how to get the different pretraining objectives working, with the workflows being a good starting point. I hope this explosion/spaCy#12459 PR will make the use of the project a bit easier by not manually adjusting the model.bin name in the project.yml.

@adrianeboyd
Copy link
Contributor

This needs to be updated to require a newer version of spacy with the model-last.bin option for spacy pretrain.

@adrianeboyd adrianeboyd force-pushed the feature/pretraining_oscar_morphologizer branch from a3f0d29 to c25ba33 Compare July 31, 2023 09:34
@adrianeboyd
Copy link
Contributor

Thanks again for your work @thomashacker!

@adrianeboyd adrianeboyd merged commit 393e79f into explosion:v3 Jul 31, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants