-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add common_terms
parameter to sklearn_api.PhrasesTransformer
#2074
Conversation
This parameter is being propagated to the underlying models.Phrases class.
this avoids the following warning: "UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmlk, in general, looks good, only several comments, please fix it and I'll merge PR
gensim/test/test_sklearn_api.py
Outdated
@@ -1151,6 +1159,58 @@ def testModelNotFitted(self): | |||
self.assertRaises(NotFittedError, phrases_transformer.transform, phrases_sentences[0]) | |||
|
|||
|
|||
class TestPhrasesTransformerCommonTerms(unittest.TestCase): | |||
def setUp(self): | |||
numpy.random.seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not good, because this freeze "global" seed (effect to all test cases, not only to your class). Why you need this here? Probably you can simply remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite honestly, I was just looking at the other test cases. Almost all of them include that line in setUp
. But I'll remove it. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you are right, that's looks like a bug for me
ivan@P50:~/release/gensim$ find gensim/ -name "*.py" | xargs grep "numpy.random.seed"
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py: id2word=dictionary, num_topics=10, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py: minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py: minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: numpy.random.seed(13)
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0) # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
gensim/test/test_sklearn_api.py: numpy.random.seed(0)
that's not your fault, please simply remove this line from your code, that's enough.
gensim/test/test_sklearn_api.py
Outdated
|
||
def testCompareToOld(self): | ||
# Phrases-model extracted from PhrasesTransformer fitted same way as in above test class TestPhrasesTransformer | ||
phrases_model = models.phrases.Phrases.load("gensim/test/test_data/phrases_for_phrases_transformer.model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of gensim/test/test_data/phrases_for_phrases_transformer.model
from gensim.test.utils import datapath
datapath("phrases_for_phrases_transformer.model")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you need to load old PhrasesTransformer
, not models.phrases.Phrases
, you shouldn't construct distinct objects and assign it to attributes manually in test.
gensim/test/test_sklearn_api.py
Outdated
# Phrases-model extracted from PhrasesTransformer fitted same way as in above test class TestPhrasesTransformer | ||
phrases_model = models.phrases.Phrases.load("gensim/test/test_data/phrases_for_phrases_transformer.model") | ||
old_phrases_transformer = PhrasesTransformer(min_count=1, threshold=1) | ||
# manually set models instead of using fit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
gensim/sklearn_api/phrases.py
Outdated
@@ -131,15 +138,16 @@ def transform(self, docs): | |||
Phrase representation for each of the input sentences. | |||
|
|||
""" | |||
if self.gensim_model is None: | |||
if self.gensim_model is None or self.phraser is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not true too: if you load old model - self.gensim_model
exist, but self.phraser
isn't (through model was fitted early).
A pre-trained Phrases model (self.gensim_model) may be set to avoid using the fit() method. In transform(), the also necessary Phraser model (self.phraser) will be instantiated if it hasn't been before.
@@ -1170,7 +1170,8 @@ def setUp(self): | |||
] | |||
|
|||
def testCompareToOld(self): | |||
old_phrases_transformer = pickle.load(datapath("phrases_transformer.pkl")) | |||
with open(datapath("phrases_transformer.pkl"), "rb") as old_phrases_transformer_pkl: | |||
old_phrases_transformer = pickle.load(old_phrases_transformer_pkl) | |||
doc = phrases_sentences[-1] | |||
phrase_tokens = old_phrases_transformer.transform(doc)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old PhrasesTransformer
didn't have the self.phraser
attribute which is checked in the new transform()
method. This causes this test to fail (at least on my local machine). It seems that the new transform()
method is being called here instead of the old one. Do I need to take extra care of object methods being pickled correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" old one" no more exists, you have an old model (like a key-value store of data) and new code.
Do I need to take extra care of object methods being pickled correctly?
Of course, see https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-normal-class-instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to write custom __setstate__
method for fix backward compatibility issue, see https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-normal-class-instances
ping @pmlk, are you planning to finish PR? |
I would like to. However, I am still not sure how to pickle instance methods (the referenced docs aren't really helpful to me). |
@pmlk why you need to pickle method? You just need to add workaround for |
ping @menshikh-iv, any more changes necessary? :-) |
Sorry for waiting @pmlk, need to add the test with loading old model, see #2074 (comment) and test for new model serialized / deserialized correctly (i.e save & load with pickle) |
[u'the', u'bank_of_america', u'offices', u'are', u'closed'] | ||
] | ||
|
||
def testCompareToOld(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add the test with loading old model
just to clarify, @menshikh-iv: this test loads an old model, so I would assume I would only need to add another test for de-/serializing the new model as per your comment
test for new model serialized / deserialized correctly (i.e save & load with pickle)
correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are correct.
common_terms
parameter to sklearn_api.PhrasesTransformer
Thanks @pmlk, congratz with first contribution 🥇 |
common_terms
parameter added to underlyingmodels.phrases.Phrases
class in commit b4515e0models.phrases.Phraser
to avoid warning: