Wrapper for Varembed Models #1067

anmolgulati · 2016-12-30T12:11:01Z

This is an effort to integrate Varembed word-embedding Model into gensim.
To train varembed models, one can use the code put up by @rguthrie3 here.
Presently, the motive is to provide wrapper for loading varembed models with gensim and not training the word embeddings.

The first draft is based on Wordrank wrapper by @parulsethi

TODOs

Add tests for varembed wrapper.
Add model files for testing loading of wrapper.
Add tutorial in documentation.
Add RST files for Sphinx APIREF autogeneration.

…nsim into varembed-worker

…el based on lee corpus for tests

…nsim into varembed-worker

anmolgulati · 2017-01-28T16:39:33Z

I've added wrapper to load varembed model into gensim and allowing keyedvectors functionalities.
As the wrapper includes dependency over morfessor package. I've factored that out into a different method and ensemble the morpheme embeddings iff the user asks for ensembling the morpheme embeddings and morfessor package is present on the system.
@tmylk @piskvorky Please review.

anmolgulati · 2017-01-29T02:49:17Z

There seems to be a small issue specifically in the morfessor package. I've submitted a PR aalto-speech/morfessor#6 for the same. We'll probably have to wait to get it merged or find some other fix to get the code running in Python 2.6. Any suggestions?

anmolgulati · 2017-01-29T03:39:40Z

@tmylk Apart from the issue in Python 2.6, which I've already discussed. The code right now fails in other versions as well, which seems to me a problem in circular dependency in imports in morfessor. Though the tests run fine on my personal machine, it fails on Travis, which is something I don't understand. Any ideas, what's going wrong?

tmylk

More tests, KeyedVectors and other small improvements needed.

tmylk · 2017-01-31T17:07:14Z

gensim/models/wrappers/varembed.py

+        """
+        Load the input-hidden weight matrix from the fast text output files.
+
+        Note that due to limitations in the FastText API, you cannot continue training


Please correct Docstring to be about varembed

tmylk · 2017-01-31T17:12:25Z

gensim/models/wrappers/varembed.py

+logger = logging.getLogger(__name__)
+
+
+class VarEmbed(Word2Vec):


Please subclass KeyedVectors

tmylk · 2017-01-31T17:12:48Z

gensim/test/test_varembed_wrapper.py

+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+
+"""
+Automated tests for checking transformation algorithms (the models package).


Please change to VarEmbed

Oh! Had missed this. Thanks. Done.

tmylk · 2017-01-31T17:17:08Z

gensim/test/test_varembed_wrapper.py

+        """Test ensembling of Morhpeme Embeddings"""
+        model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,
+                                                       morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
+        self.model_sanity(model)


please test that syn0 is different compared to non-morpheme model.

Yes, added now.

tmylk · 2017-01-31T17:21:42Z

gensim/models/wrappers/varembed.py

+        if use_morphemes:
+            try:
+                import morfessor
+                morfessor_model = morfessor.MorfessorIO().read_binary_model_file(morfessor_model)


could you raise an issue in varembed github as heads up that read_binary_model_file will be deprecated by morfessor?

Yes, Sounds good. I've put up an issue rguthrie3/MorphologicalPriorsForWordEmbeddings#3 to notify them about the new release on morfessor as well.

…sor version in travis script

…support is only provided in python 2.7 and above. Also added more comments

tmylk · 2017-02-01T14:23:23Z

gensim/models/wrappers/varembed.py

+        word_embeddings = D['word_embeddings']
+        morpho_embeddings = D['morpheme_embeddings']
+        result.load_word_embeddings(word_embeddings, word_to_ix)
+        if use_morphemes:


just if morfessor_model is enough

Yes, sounds good. Done.

tmylk · 2017-02-01T14:24:44Z

gensim/models/wrappers/varembed.py

+        logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size)
+
+
+    def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix):


maybe add_morphemes_to_word_embeddings?

Changed method name now.

…e code-style

tmylk · 2017-02-02T15:37:18Z

Will there be a tutorial ipynb?

anmolgulati · 2017-02-02T15:46:19Z

Yes, I'll just add one.

anmolgulati · 2017-02-02T18:49:14Z

@tmylk I've now added a tutorial on varembed model as well. Please review once, and I feel we could merge it then.

piskvorky · 2017-02-05T06:52:44Z

gensim/test/test_varembed_wrapper.py

+           Test only in Python 2.7 and above. Add Morphemes is not supported in earlier versions.
+        """
+        model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file)
+        model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,


Code style: hanging indent please (not vertical indent).

piskvorky · 2017-02-05T06:52:52Z

gensim/test/test_varembed_wrapper.py

+
+    @unittest.skipUnless(sys.version_info < (2, 7), 'Test to check throwing exception in Python 2.6 and earlier')
+    def testAddMorphemesThrowsExceptionInPython26(self):
+        self.assertRaises(Exception, varembed.VarEmbed.load_varembed_format, vectors=varembed_model_vector_file,


Hanging indent.

tmylk · 2017-02-06T14:38:40Z

gensim/models/wrappers/varembed.py

+
+This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2].
+
+The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that.


You mean that VarEmbed gensim wrapper doesn't support it? Also, someone might be confused that you are suggesting to load varembed and then train it as Word2Vec on new words which is incorrect

Oh yes, you are right. It could be bit confusing earlier. Updated it now.

tmylk · 2017-02-06T16:45:10Z

Thanks for adding the wrapper! It will be part of this week's release.

Let's add a benchmark notebook and a blog in another PR. Will publicize when that is ready.

anmolgulati · 2017-02-06T16:56:51Z

Cool! Great! That's Awesome! :D
Yes sure, will open up a new PR for the benchmark and blog documentation.

anmolgulati added 12 commits December 30, 2016 17:35

Added initial draft of varembed wrapper

829c683

Fixed pep8 errors

f62e41a

Merge branch 'develop' into varembed-worker

88e1324

Removed redundant code

c9abe31

Merge branch 'varembed-worker' of https://github.com/anmol01gulati/ge…

66365f6

…nsim into varembed-worker

Merge branch 'develop' into varembed-worker

a488723

Made changes to Varembed Class. Also added more comments.

345e184

Added tests for varembed wrapper. Also added pre-trained varembed mod…

9903a02

…el based on lee corpus for tests

Added varembed wrapper in init.py

cc4a549

Merge branch 'develop' into varembed-worker

1be271e

Changed default value of morphessor flag

4f11aeb

Merge branch 'varembed-worker' of https://github.com/anmol01gulati/ge…

77db09b

…nsim into varembed-worker

anmolgulati force-pushed the varembed-worker branch from 5b78e5d to 6034e68 Compare January 28, 2017 20:56

Moved import morfessor into method to avoid circular dependency

6034e68

anmolgulati mentioned this pull request Jan 29, 2017

Fix version check in io.py aalto-speech/morfessor#5

Closed

tmylk suggested changes Jan 31, 2017

View reviewed changes

anmolgulati added 2 commits February 1, 2017 13:08

Added KeyedVectors as subclass. Also added more tests. Changed morfes…

cadccd4

…sor version in travis script

Added exception when importing morfessor in Python 2.6 or earlier as …

5777fe7

…support is only provided in python 2.7 and above. Also added more comments

anmolgulati mentioned this pull request Feb 1, 2017

Informing regarding Morfessor New Version rguthrie3/MorphologicalPriorsForWordEmbeddings#3

Open

anmolgulati force-pushed the varembed-worker branch from c0ce65b to ef17a12 Compare February 1, 2017 12:52

Added test to check exception in ensemble method for Python 2.6

bf57058

anmolgulati force-pushed the varembed-worker branch from ef17a12 to bf57058 Compare February 1, 2017 13:45

Added Varembed RST files and updated apiref.rst

3095620

tmylk reviewed Feb 1, 2017

View reviewed changes

anmolgulati force-pushed the varembed-worker branch 2 times, most recently from 4f5e359 to 3095620 Compare February 2, 2017 14:35

Refactored method names. Also reverted back to 140 characters per lin…

dbb969b

…e code-style

Added Varembed IPython Notebook Tutorial

74e3b12

anmolgulati changed the title ~~[WIP] Wrapper for Varembed Models~~ Wrapper for Varembed Models Feb 2, 2017

Renamed method to test add_morphemes_to_embeddings

53889e5

piskvorky requested changes Feb 5, 2017

View reviewed changes

anmolgulati added 2 commits February 5, 2017 13:09

Fixed code style. Alinged to hanging indent

6e7f681

fixed code style

cee0410

tmylk reviewed Feb 6, 2017

View reviewed changes

Updated comments

3d0c74d

tmylk merged commit e1c3a0b into piskvorky:develop Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrapper for Varembed Models #1067

Wrapper for Varembed Models #1067

anmolgulati commented Dec 30, 2016 •

edited

Loading

anmolgulati commented Jan 28, 2017

anmolgulati commented Jan 29, 2017 •

edited

Loading

anmolgulati commented Jan 29, 2017 •

edited

Loading

tmylk left a comment

tmylk Jan 31, 2017

anmolgulati Feb 1, 2017

tmylk Jan 31, 2017

anmolgulati Feb 1, 2017

tmylk Jan 31, 2017

anmolgulati Feb 1, 2017

tmylk Jan 31, 2017

anmolgulati Feb 1, 2017

tmylk Jan 31, 2017

anmolgulati Feb 1, 2017

tmylk Feb 1, 2017

anmolgulati Feb 1, 2017

tmylk Feb 1, 2017

anmolgulati Feb 1, 2017

tmylk commented Feb 2, 2017

anmolgulati commented Feb 2, 2017

anmolgulati commented Feb 2, 2017 •

edited

Loading

piskvorky Feb 5, 2017

anmolgulati Feb 5, 2017

piskvorky Feb 5, 2017

anmolgulati Feb 5, 2017 •

edited

Loading

tmylk Feb 6, 2017

anmolgulati Feb 6, 2017

tmylk commented Feb 6, 2017

anmolgulati commented Feb 6, 2017

		logger = logging.getLogger(__name__)


		class VarEmbed(Word2Vec):

		logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size)


		def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix):


		This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2].

		The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that.

Wrapper for Varembed Models #1067

Wrapper for Varembed Models #1067

Conversation

anmolgulati commented Dec 30, 2016 • edited Loading

anmolgulati commented Jan 28, 2017

anmolgulati commented Jan 29, 2017 • edited Loading

anmolgulati commented Jan 29, 2017 • edited Loading

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Feb 2, 2017

anmolgulati commented Feb 2, 2017

anmolgulati commented Feb 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmolgulati Feb 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Feb 6, 2017

anmolgulati commented Feb 6, 2017

anmolgulati commented Dec 30, 2016 •

edited

Loading

anmolgulati commented Jan 29, 2017 •

edited

Loading

anmolgulati commented Jan 29, 2017 •

edited

Loading

anmolgulati commented Feb 2, 2017 •

edited

Loading

anmolgulati Feb 5, 2017 •

edited

Loading