-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Added sklearn wrapper for w2v model #1437
[WIP] Added sklearn wrapper for w2v model #1437
Conversation
from gensim.sklearn_integration import BaseSklearnWrapper | ||
|
||
|
||
class SklW2VModel(BaseSklearnWrapper, TransformerMixin, BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is already inside the sklearn_integration
package, and will usually serve (in sklearn pipelines) as a Transformer
, I'd suggest the name W2VTransformer
as a more simple, direct, and non-abbreviation-reliant class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree surely that W2VTransformer
would be a simpler and non-abbreviation-reliant class name however SklW2VModel
is consistent with the naming of the scikit-learn wrappers added for other Gensim models like SklLdaModel
, SklLsiModel
, SklRpModel
. SklLdaSeqModel
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are those classes old & already-widely relied upon, or also new/recent? It may make sense to fix all their names to avoid an idiosyncratic abbreviation (Skl
) and match sklearn's role-terminology! cc @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these classes are relatively recent and I don't think they are currently widely relied upon in the codebase. So yes it might be a good idea to change the names for all the existing wrapper classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of the history of those classes but agree with @gojomo we don't need the (sub)package repeated in class names.
The only exception would be if they clash with the "original" class name (no sklearn), which could be confusing and a potential support headache. But if you always slap on Transformer
or something, that should not happen?
def __init__(self, size=100, alpha=0.025, window=5, min_count=5, | ||
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001, | ||
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, | ||
trim_rule=None, sorted_vocab=1, batch_words=10000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure these values are always kept in sync with the underlying Word2Vec
class, it might be appropriate to use inspect.getargspec()
on the Word2Vec.__init__()
method. Perhaps this method could then just take **kwargs
, with the defaults dict from getargspec()
merged together. (Keeping these as a dict might further make many of the explicit attributes unnecessary, and the implementation of get_params()
trivial.)
Update model using newly added sentences. | ||
""" | ||
if self.gensim_model is None: | ||
self.gensim_model = models.Word2Vec(size=self.size, alpha=self.alpha, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since train()
(below) requires additional model vocab initialization (as by build_vocab()
), I can't see this path, where the model doesn't already exist, possibly working. More generally, I'm not sure a partial_fit()
method makes sense for Word2Vec - incremental training is at best experimental, and not at all the usual mode of operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this did come up during some earlier discussions about the implementation as well. :)
I have now modified the partial_fit
function to raise a NotImplementedError
on being invoked.
* added skl wrapper for w2v model * added unit tests for sklearn word2vec wrapper * added 'testPipeline' test for w2v skl wrapper * PEP8 fix * fixed 'map' issue for Python3 * removed 'partial_fit' function * Update __init__.py
This PR adds scikit-learn wrapper for Gensim's Word2Vec model.