-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scikit-learn wrapper for FastText model #2178
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim/sklearn_api/ftmodel.py
Outdated
>>> | ||
>>> # What is the vector representation of the word 'graph'? | ||
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system']) | ||
>>> assert wordvecs.shape == (2, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)
gensim/sklearn_api/ftmodel.py
Outdated
|
||
Parameters | ||
---------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need empty line
batch_words : int, optional | ||
Target size (in words) for batches of examples passed to worker threads (and | ||
thus cython routines).(Larger batches will be passed if individual | ||
texts are longer than 10000 words, but the standard cython code truncates to that maximum.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing empty line (at the end of docstring)
gensim/test/test_sklearn_api.py
Outdated
|
||
def testConsistencyWithGensimModel(self): | ||
# training a FTTransformer | ||
self.model = FTTransformer(size=10, min_count=0, seed=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for check this, you also need to pin workers=1
(for both models)
gensim/test/test_sklearn_api.py
Outdated
word = texts[0][0] | ||
vec_transformer_api = self.model.transform(word) # vector returned by FTTransformer | ||
vec_gensim_model = gensim_ftmodel[word] # vector returned by FastText | ||
passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e-1
looks too large, why this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case please remove it
gensim/test/test_sklearn_api.py
Outdated
model_dump = pickle.dumps(self.model) | ||
model_load = pickle.loads(model_dump) | ||
|
||
word = texts[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass all corpus that you have + check with out-of-vocab words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcemilg still here, please don't forget to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed it, thanks.
Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry. |
I reset mistaken git commits and I pushed my last changes. Sorry again for trouble. |
gensim/test/test_sklearn_api.py
Outdated
@@ -1213,5 +1214,111 @@ def testModelNotFitted(self): | |||
self.assertRaises(NotFittedError, phrases_transformer.transform, phrases_sentences[0]) | |||
|
|||
|
|||
class TestFastTextWrapper(unittest.TestCase): | |||
def setUp(self): | |||
numpy.random.seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy.random.seed(0)
affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed
in your code @mcemilg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I removed the numpy.random.seed
calls.
Thanks @mcemilg, congratz with your first contribution 🥇 |
Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it. |
@mcemilg If you have time for it - of course, I will be very grateful 🔥 |
Okay, I will look it as soon as possible. 👍 |
@mcemilg note: global seeding should never happen (i.e have a look through all library, not |
Fixes #2138
Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.