Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common_terms parameter to sklearn_api.PhrasesTransformer #2074

Merged
merged 10 commits into from
Oct 4, 2018

Conversation

pmlk
Copy link
Contributor

@pmlk pmlk commented May 30, 2018

  • make use of common_terms parameter added to underlying models.phrases.Phrases class in commit b4515e0
  • utilize models.phrases.Phraser to avoid warning:

UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class

pmlk added 3 commits May 29, 2018 22:42
This parameter is being propagated to the underlying models.Phrases class.
this avoids the following  warning:
"UserWarning: For a faster implementation, use the
gensim.models.phrases.Phraser class"
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmlk, in general, looks good, only several comments, please fix it and I'll merge PR

gensim/test/test_sklearn_api.py Show resolved Hide resolved
@@ -1151,6 +1159,58 @@ def testModelNotFitted(self):
self.assertRaises(NotFittedError, phrases_transformer.transform, phrases_sentences[0])


class TestPhrasesTransformerCommonTerms(unittest.TestCase):
def setUp(self):
numpy.random.seed(0)
Copy link
Contributor

@menshikh-iv menshikh-iv Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not good, because this freeze "global" seed (effect to all test cases, not only to your class). Why you need this here? Probably you can simply remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite honestly, I was just looking at the other test cases. Almost all of them include that line in setUp. But I'll remove it. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you are right, that's looks like a bug for me

ivan@P50:~/release/gensim$ find gensim/ -name "*.py" | xargs grep "numpy.random.seed"
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:            id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py:            id2word=dictionary, num_topics=10, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py:            minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py:            minimum_probability=0, random_state=numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        numpy.random.seed(13)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)  # set fixed seed to get similar values everytime
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=numpy.random.seed(0))
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)
gensim/test/test_sklearn_api.py:        numpy.random.seed(0)

that's not your fault, please simply remove this line from your code, that's enough.


def testCompareToOld(self):
# Phrases-model extracted from PhrasesTransformer fitted same way as in above test class TestPhrasesTransformer
phrases_model = models.phrases.Phrases.load("gensim/test/test_data/phrases_for_phrases_transformer.model")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of gensim/test/test_data/phrases_for_phrases_transformer.model

from gensim.test.utils import datapath
datapath("phrases_for_phrases_transformer.model")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you need to load old PhrasesTransformer, not models.phrases.Phrases, you shouldn't construct distinct objects and assign it to attributes manually in test.

# Phrases-model extracted from PhrasesTransformer fitted same way as in above test class TestPhrasesTransformer
phrases_model = models.phrases.Phrases.load("gensim/test/test_data/phrases_for_phrases_transformer.model")
old_phrases_transformer = PhrasesTransformer(min_count=1, threshold=1)
# manually set models instead of using fit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@@ -131,15 +138,16 @@ def transform(self, docs):
Phrase representation for each of the input sentences.

"""
if self.gensim_model is None:
if self.gensim_model is None or self.phraser is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true too: if you load old model - self.gensim_model exist, but self.phraser isn't (through model was fitted early).

gensim/sklearn_api/phrases.py Show resolved Hide resolved
pmlk added 3 commits August 14, 2018 10:51
A pre-trained Phrases model (self.gensim_model) may be set to avoid
using the fit() method. In transform(), the also necessary Phraser model
(self.phraser) will be instantiated if it hasn't been before.
@@ -1170,7 +1170,8 @@ def setUp(self):
]

def testCompareToOld(self):
old_phrases_transformer = pickle.load(datapath("phrases_transformer.pkl"))
with open(datapath("phrases_transformer.pkl"), "rb") as old_phrases_transformer_pkl:
old_phrases_transformer = pickle.load(old_phrases_transformer_pkl)
doc = phrases_sentences[-1]
phrase_tokens = old_phrases_transformer.transform(doc)[0]
Copy link
Contributor Author

@pmlk pmlk Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old PhrasesTransformer didn't have the self.phraser attribute which is checked in the new transform() method. This causes this test to fail (at least on my local machine). It seems that the new transform() method is being called here instead of the old one. Do I need to take extra care of object methods being pickled correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" old one" no more exists, you have an old model (like a key-value store of data) and new code.

Do I need to take extra care of object methods being pickled correctly?

Of course, see https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-normal-class-instances

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to write custom __setstate__ method for fix backward compatibility issue, see https://docs.python.org/2/library/pickle.html#pickling-and-unpickling-normal-class-instances

gensim/test/test_sklearn_api.py Outdated Show resolved Hide resolved
@menshikh-iv
Copy link
Contributor

ping @pmlk, are you planning to finish PR?

@pmlk
Copy link
Contributor Author

pmlk commented Aug 27, 2018

I would like to. However, I am still not sure how to pickle instance methods (the referenced docs aren't really helpful to me).

@menshikh-iv
Copy link
Contributor

@pmlk why you need to pickle method? You just need to add workaround for phraser and common_terms: If old version of object loaded (without this parameters - fill this parameter in __setstate__). Also see good examples of __setstate__ & __getstate__ from https://stackoverflow.com/questions/1939058/simple-example-of-use-of-setstate-and-getstate

@pmlk
Copy link
Contributor Author

pmlk commented Sep 15, 2018

ping @menshikh-iv, any more changes necessary? :-)

@menshikh-iv
Copy link
Contributor

Sorry for waiting @pmlk,

need to add the test with loading old model, see #2074 (comment) and test for new model serialized / deserialized correctly (i.e save & load with pickle)

[u'the', u'bank_of_america', u'offices', u'are', u'closed']
]

def testCompareToOld(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add the test with loading old model

just to clarify, @menshikh-iv: this test loads an old model, so I would assume I would only need to add another test for de-/serializing the new model as per your comment

test for new model serialized / deserialized correctly (i.e save & load with pickle)

correct me if I'm wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are correct.

@menshikh-iv menshikh-iv changed the title Propagate common_terms parameter to sklearn_api PhrasesTransformer Add common_terms parameter to sklearn_api.PhrasesTransformer Oct 4, 2018
@menshikh-iv
Copy link
Contributor

Thanks @pmlk, congratz with first contribution 🥇

@menshikh-iv menshikh-iv merged commit 367bdbd into piskvorky:develop Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants