Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added sklearn wrapper for w2v model #1437

Merged
merged 9 commits into from
Jun 29, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR adds scikit-learn wrapper for Gensim's Word2Vec model.

from gensim.sklearn_integration import BaseSklearnWrapper


class SklW2VModel(BaseSklearnWrapper, TransformerMixin, BaseEstimator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is already inside the sklearn_integration package, and will usually serve (in sklearn pipelines) as a Transformer, I'd suggest the name W2VTransformer as a more simple, direct, and non-abbreviation-reliant class name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree surely that W2VTransformer would be a simpler and non-abbreviation-reliant class name however SklW2VModel is consistent with the naming of the scikit-learn wrappers added for other Gensim models like SklLdaModel, SklLsiModel, SklRpModel. SklLdaSeqModel etc.

Copy link
Collaborator

@gojomo gojomo Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are those classes old & already-widely relied upon, or also new/recent? It may make sense to fix all their names to avoid an idiosyncratic abbreviation (Skl) and match sklearn's role-terminology! cc @piskvorky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these classes are relatively recent and I don't think they are currently widely relied upon in the codebase. So yes it might be a good idea to change the names for all the existing wrapper classes.

Copy link
Owner

@piskvorky piskvorky Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of the history of those classes but agree with @gojomo we don't need the (sub)package repeated in class names.

The only exception would be if they clash with the "original" class name (no sklearn), which could be confusing and a potential support headache. But if you always slap on Transformer or something, that should not happen?

def __init__(self, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=10000):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure these values are always kept in sync with the underlying Word2Vec class, it might be appropriate to use inspect.getargspec() on the Word2Vec.__init__() method. Perhaps this method could then just take **kwargs, with the defaults dict from getargspec() merged together. (Keeping these as a dict might further make many of the explicit attributes unnecessary, and the implementation of get_params() trivial.)

Update model using newly added sentences.
"""
if self.gensim_model is None:
self.gensim_model = models.Word2Vec(size=self.size, alpha=self.alpha,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since train() (below) requires additional model vocab initialization (as by build_vocab()), I can't see this path, where the model doesn't already exist, possibly working. More generally, I'm not sure a partial_fit() method makes sense for Word2Vec - incremental training is at best experimental, and not at all the usual mode of operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this did come up during some earlier discussions about the implementation as well. :)
I have now modified the partial_fit function to raise a NotImplementedError on being invoked.

@menshikh-iv menshikh-iv merged commit 2484eb0 into piskvorky:develop Jun 29, 2017
saparina pushed a commit to saparina/gensim that referenced this pull request Jul 9, 2017
* added skl wrapper for w2v model

* added unit tests for sklearn word2vec wrapper

* added 'testPipeline' test for w2v skl wrapper

* PEP8 fix

* fixed 'map' issue for Python3

* removed 'partial_fit' function

* Update __init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants