[WIP] Added sklearn wrapper for w2v model #1437

chinmayapancholi13 · 2017-06-21T11:22:57Z

This PR adds scikit-learn wrapper for Gensim's Word2Vec model.

gojomo · 2017-06-22T18:51:53Z

gensim/sklearn_integration/sklearn_wrapper_gensim_w2vmodel.py

+from gensim.sklearn_integration import BaseSklearnWrapper
+
+
+class SklW2VModel(BaseSklearnWrapper, TransformerMixin, BaseEstimator):


As this is already inside the sklearn_integration package, and will usually serve (in sklearn pipelines) as a Transformer, I'd suggest the name W2VTransformer as a more simple, direct, and non-abbreviation-reliant class name.

I agree surely that W2VTransformer would be a simpler and non-abbreviation-reliant class name however SklW2VModel is consistent with the naming of the scikit-learn wrappers added for other Gensim models like SklLdaModel, SklLsiModel, SklRpModel. SklLdaSeqModel etc.

Hmm, are those classes old & already-widely relied upon, or also new/recent? It may make sense to fix all their names to avoid an idiosyncratic abbreviation (Skl) and match sklearn's role-terminology! cc @piskvorky

All these classes are relatively recent and I don't think they are currently widely relied upon in the codebase. So yes it might be a good idea to change the names for all the existing wrapper classes.

I'm not aware of the history of those classes but agree with @gojomo we don't need the (sub)package repeated in class names.

The only exception would be if they clash with the "original" class name (no sklearn), which could be confusing and a potential support headache. But if you always slap on Transformer or something, that should not happen?

gojomo · 2017-06-22T18:58:43Z

gensim/sklearn_integration/sklearn_wrapper_gensim_w2vmodel.py

+    def __init__(self, size=100, alpha=0.025, window=5, min_count=5,
+            max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
+            sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
+            trim_rule=None, sorted_vocab=1, batch_words=10000):


To ensure these values are always kept in sync with the underlying Word2Vec class, it might be appropriate to use inspect.getargspec() on the Word2Vec.__init__() method. Perhaps this method could then just take **kwargs, with the defaults dict from getargspec() merged together. (Keeping these as a dict might further make many of the explicit attributes unnecessary, and the implementation of get_params() trivial.)

gojomo · 2017-06-22T19:06:56Z

gensim/sklearn_integration/sklearn_wrapper_gensim_w2vmodel.py

+        Update model using newly added sentences.
+        """
+        if self.gensim_model is None:
+            self.gensim_model = models.Word2Vec(size=self.size, alpha=self.alpha,


Since train() (below) requires additional model vocab initialization (as by build_vocab()), I can't see this path, where the model doesn't already exist, possibly working. More generally, I'm not sure a partial_fit() method makes sense for Word2Vec - incremental training is at best experimental, and not at all the usual mode of operation.

Yes, this did come up during some earlier discussions about the implementation as well. :)
I have now modified the partial_fit function to raise a NotImplementedError on being invoked.

* added skl wrapper for w2v model * added unit tests for sklearn word2vec wrapper * added 'testPipeline' test for w2v skl wrapper * PEP8 fix * fixed 'map' issue for Python3 * removed 'partial_fit' function * Update __init__.py

chinmayapancholi13 added 2 commits June 21, 2017 16:51

added skl wrapper for w2v model

64e80cc

added unit tests for sklearn word2vec wrapper

821907f

gojomo reviewed Jun 22, 2017

View reviewed changes

Chinmaya Pancholi and others added 7 commits June 26, 2017 16:52

added 'testPipeline' test for w2v skl wrapper

78e4078

PEP8 fix

225460d

fixed 'map' issue for Python3

3ca108d

removed 'partial_fit' function

4041444

Merge branch 'develop' into w2v_skl_wrapper

292f85e

Merge branch 'develop' into w2v_skl_wrapper

b2a69b5

Update __init__.py

10050ae

menshikh-iv merged commit 2484eb0 into piskvorky:develop Jun 29, 2017

chinmayapancholi13 mentioned this pull request Jul 6, 2017

Update sklearn API for Gensim models #1473

Merged

menshikh-iv mentioned this pull request Oct 2, 2017

Support sklearn pipeline interface. Continuing #932. #1123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added sklearn wrapper for w2v model #1437

[WIP] Added sklearn wrapper for w2v model #1437

chinmayapancholi13 commented Jun 21, 2017

gojomo Jun 22, 2017

chinmayapancholi13 Jun 28, 2017

gojomo Jun 29, 2017 •

edited

Loading

chinmayapancholi13 Jun 29, 2017

piskvorky Jun 29, 2017 •

edited

Loading

gojomo Jun 22, 2017

gojomo Jun 22, 2017

chinmayapancholi13 Jun 28, 2017

		from gensim.sklearn_integration import BaseSklearnWrapper


		class SklW2VModel(BaseSklearnWrapper, TransformerMixin, BaseEstimator):

[WIP] Added sklearn wrapper for w2v model #1437

[WIP] Added sklearn wrapper for w2v model #1437

Conversation

chinmayapancholi13 commented Jun 21, 2017

gojomo Jun 22, 2017

Choose a reason for hiding this comment

chinmayapancholi13 Jun 28, 2017

Choose a reason for hiding this comment

gojomo Jun 29, 2017 • edited Loading

Choose a reason for hiding this comment

chinmayapancholi13 Jun 29, 2017

Choose a reason for hiding this comment

piskvorky Jun 29, 2017 • edited Loading

Choose a reason for hiding this comment

gojomo Jun 22, 2017

Choose a reason for hiding this comment

gojomo Jun 22, 2017

Choose a reason for hiding this comment

chinmayapancholi13 Jun 28, 2017

Choose a reason for hiding this comment

gojomo Jun 29, 2017 •

edited

Loading

piskvorky Jun 29, 2017 •

edited

Loading