-
-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing Morphological Analysis with spaCy Pretraining #188
Enhancing Morphological Analysis with spaCy Pretraining #188
Conversation
This project was finished some time ago, but we wanted to publish it with a post about pretraining. However, we never had time to do the post due to other priorities. It would make more sense to publish the project first and work on the blog post later. For the experiments, I think it would be good to set up some GPU VMs to get some good statistics on how much pretraining can boost performance. My local experiments already indicated such a boost. |
Very nice! I'm testing it out a bit looking at supporting floret vectors. Because I was mainly looking at vectors, I was expecting to see a comparison for vectors vs. pretraining+vectors, but training with just vectors doesn't seem to be included? |
Just a note that the pretraining+training workflows could use a few config adjustments. I think having the number of pretraining epochs for each type of pretraining as a setting would make it possible to run the pretraining+training workflows without having to specify the pretraining model name. (It's possible there's some annoying off-by-1 math/naming involved, or maybe you could add a script that picks the largest filename from the directory for the config override.) |
There is annoying off-by-one naming for sure, argh. |
I've added a workflow for training the morphologizer with only static vectors, this makes 5 different combinations (for every of the three languages):
I've ran all experiments for the English language with my GPU. Here are the results for future reference:
|
Additionally, I've fixed some bugs and adjusted the config settings. However, as you mentioned, there's still a problem with the pretrained weights variable in the |
Interesting results! This makes it look like the pretraining doesn't help much, but English morphology is kind of boring, so I think we need to expand the evaluation to other languages/tasks before coming to much of a conclusion. In most of my quick (not careful) tests with floret vectors there did seem to be more differences in the final performance. We'd probably also want to test with vectors that are not trained on the same texts as are used for the pretraining. |
I agree; I think more experiments in the future will be more helpful to see how much pretraining actually influences training. However, I think this project is still a great reference for users to see how to get the different pretraining objectives working, with the workflows being a good starting point. I hope this explosion/spaCy#12459 |
This needs to be updated to require a newer version of spacy with the |
a3f0d29
to
c25ba33
Compare
Thanks again for your work @thomashacker! |
Description
This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.
Types of change
New project
Checklist
update
scripts in the.github
folder, and all the configs and docs are up-to-date.