Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapper for Varembed Models #1067

Merged
merged 23 commits into from
Feb 6, 2017
Merged

Conversation

anmolgulati
Copy link
Contributor

@anmolgulati anmolgulati commented Dec 30, 2016

This is an effort to integrate Varembed word-embedding Model into gensim.
To train varembed models, one can use the code put up by @rguthrie3 here.
Presently, the motive is to provide wrapper for loading varembed models with gensim and not training the word embeddings.

The first draft is based on Wordrank wrapper by @parulsethi

TODOs

  • Add tests for varembed wrapper.
  • Add model files for testing loading of wrapper.
  • Add tutorial in documentation.
  • Add RST files for Sphinx APIREF autogeneration.

@anmolgulati
Copy link
Contributor Author

I've added wrapper to load varembed model into gensim and allowing keyedvectors functionalities.
As the wrapper includes dependency over morfessor package. I've factored that out into a different method and ensemble the morpheme embeddings iff the user asks for ensembling the morpheme embeddings and morfessor package is present on the system.
@tmylk @piskvorky Please review.

@anmolgulati
Copy link
Contributor Author

anmolgulati commented Jan 29, 2017

There seems to be a small issue specifically in the morfessor package. I've submitted a PR aalto-speech/morfessor#6 for the same. We'll probably have to wait to get it merged or find some other fix to get the code running in Python 2.6. Any suggestions?

@anmolgulati
Copy link
Contributor Author

anmolgulati commented Jan 29, 2017

@tmylk Apart from the issue in Python 2.6, which I've already discussed. The code right now fails in other versions as well, which seems to me a problem in circular dependency in imports in morfessor. Though the tests run fine on my personal machine, it fails on Travis, which is something I don't understand. Any ideas, what's going wrong?

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests, KeyedVectors and other small improvements needed.

"""
Load the input-hidden weight matrix from the fast text output files.

Note that due to limitations in the FastText API, you cannot continue training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct Docstring to be about varembed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

logger = logging.getLogger(__name__)


class VarEmbed(Word2Vec):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please subclass KeyedVectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Automated tests for checking transformation algorithms (the models package).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to VarEmbed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Had missed this. Thanks. Done.

"""Test ensembling of Morhpeme Embeddings"""
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,
morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
self.model_sanity(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test that syn0 is different compared to non-morpheme model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added now.

if use_morphemes:
try:
import morfessor
morfessor_model = morfessor.MorfessorIO().read_binary_model_file(morfessor_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you raise an issue in varembed github as heads up that read_binary_model_file will be deprecated by morfessor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sounds good. I've put up an issue rguthrie3/MorphologicalPriorsForWordEmbeddings#3 to notify them about the new release on morfessor as well.

…support is only provided in python 2.7 and above. Also added more comments
word_embeddings = D['word_embeddings']
morpho_embeddings = D['morpheme_embeddings']
result.load_word_embeddings(word_embeddings, word_to_ix)
if use_morphemes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just if morfessor_model is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good. Done.

logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size)


def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add_morphemes_to_word_embeddings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed method name now.

@anmolgulati anmolgulati force-pushed the varembed-worker branch 2 times, most recently from 4f5e359 to 3095620 Compare February 2, 2017 14:35
@tmylk
Copy link
Contributor

tmylk commented Feb 2, 2017

Will there be a tutorial ipynb?

@anmolgulati
Copy link
Contributor Author

Yes, I'll just add one.

@anmolgulati
Copy link
Contributor Author

anmolgulati commented Feb 2, 2017

@tmylk I've now added a tutorial on varembed model as well. Please review once, and I feel we could merge it then.

@anmolgulati anmolgulati changed the title [WIP] Wrapper for Varembed Models Wrapper for Varembed Models Feb 2, 2017
Test only in Python 2.7 and above. Add Morphemes is not supported in earlier versions.
"""
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file)
model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: hanging indent please (not vertical indent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@unittest.skipUnless(sys.version_info < (2, 7), 'Test to check throwing exception in Python 2.6 and earlier')
def testAddMorphemesThrowsExceptionInPython26(self):
self.assertRaises(Exception, varembed.VarEmbed.load_varembed_format, vectors=varembed_model_vector_file,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent.

Copy link
Contributor Author

@anmolgulati anmolgulati Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now.


This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2].

The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that VarEmbed gensim wrapper doesn't support it? Also, someone might be confused that you are suggesting to load varembed and then train it as Word2Vec on new words which is incorrect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you are right. It could be bit confusing earlier. Updated it now.

@tmylk tmylk merged commit e1c3a0b into piskvorky:develop Feb 6, 2017
@tmylk
Copy link
Contributor

tmylk commented Feb 6, 2017

Thanks for adding the wrapper! It will be part of this week's release.

Let's add a benchmark notebook and a blog in another PR. Will publicize when that is ready.

@anmolgulati
Copy link
Contributor Author

Cool! Great! That's Awesome! :D
Yes sure, will open up a new PR for the benchmark and blog documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants