Scikit-learn wrapper for FastText model #2178

mcemilg · 2018-09-11T20:04:02Z

Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.

menshikh-iv

Good work @mcemilg! Please continue.

BTW I see than CI failed by unrelated reason. This will be fixed when we merged #2127 (after you also need to merge fresh develop to your branch)

menshikh-iv · 2018-09-13T01:54:44Z

gensim/sklearn_api/ftmodel.py

+>>>
+>>> # What is the vector representation of the word 'graph'?
+>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
+>>> assert wordvecs.shape == (2, 10)


Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)

menshikh-iv · 2018-09-13T01:55:05Z

gensim/sklearn_api/ftmodel.py

+
+        Parameters
+        ----------
+


No need empty line

menshikh-iv · 2018-09-13T01:56:11Z

gensim/sklearn_api/ftmodel.py

+        batch_words : int, optional
+            Target size (in words) for batches of examples passed to worker threads (and
+            thus cython routines).(Larger batches will be passed if individual
+            texts are longer than 10000 words, but the standard cython code truncates to that maximum.)


missing empty line (at the end of docstring)

menshikh-iv · 2018-09-13T01:57:35Z

gensim/test/test_sklearn_api.py

+
+    def testConsistencyWithGensimModel(self):
+        # training a FTTransformer
+        self.model = FTTransformer(size=10, min_count=0, seed=42)


for check this, you also need to pin workers=1 (for both models)

menshikh-iv · 2018-09-13T01:58:12Z

gensim/test/test_sklearn_api.py

+        word = texts[0][0]
+        vec_transformer_api = self.model.transform(word)  # vector returned by FTTransformer
+        vec_gensim_model = gensim_ftmodel[word]  # vector returned by FastText
+        passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1)


1e-1 looks too large, why this needed?

I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.

In this case please remove it

menshikh-iv · 2018-09-13T01:59:13Z

gensim/test/test_sklearn_api.py

+        model_dump = pickle.dumps(self.model)
+        model_load = pickle.loads(model_dump)
+
+        word = texts[0][0]


pass all corpus that you have + check with out-of-vocab words

@mcemilg still here, please don't forget to fix

Just fixed it, thanks.

mcemilg · 2018-09-14T23:42:19Z

Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry.

mcemilg · 2018-09-16T17:23:13Z

I reset mistaken git commits and I pushed my last changes. Sorry again for trouble.

menshikh-iv · 2018-09-18T01:09:51Z

gensim/test/test_sklearn_api.py

@@ -1213,5 +1214,111 @@ def testModelNotFitted(self):
        self.assertRaises(NotFittedError, phrases_transformer.transform, phrases_sentences[0])


+class TestFastTextWrapper(unittest.TestCase):
+    def setUp(self):
+        numpy.random.seed(0)


numpy.random.seed(0) affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed in your code @mcemilg ?

Okay, I removed the numpy.random.seed calls.

menshikh-iv · 2018-09-19T02:55:20Z

Thanks @mcemilg, congratz with your first contribution 🥇

mcemilg · 2018-09-20T18:37:57Z

Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it.

menshikh-iv · 2018-09-20T18:40:56Z

@mcemilg If you have time for it - of course, I will be very grateful 🔥

mcemilg · 2018-09-20T18:46:13Z

Okay, I will look it as soon as possible. 👍

menshikh-iv · 2018-09-20T18:48:48Z

@mcemilg note: global seeding should never happen (i.e have a look through all library, not test only).

mcemilg added 2 commits September 11, 2018 22:48

Add scikit-learn wrapper for fasttext model.

7832107

Add sklearn fasttext wrapper test.

dc4e721

menshikh-iv suggested changes Sep 13, 2018

View reviewed changes

Fix docstring.

735e690

mcemilg added 2 commits September 16, 2018 20:14

Add more examples.

b1df5ee

Add tests for oov words. Fix some tests.

d7a6a5b

Pass all corpus on persistence test.

a0d5993

menshikh-iv reviewed Sep 18, 2018

View reviewed changes

Remove numpy.random.seed calls.

7c23783

menshikh-iv merged commit 97783a4 into piskvorky:develop Sep 19, 2018

mcemilg deleted the ft-wrapper-sklearn branch September 20, 2018 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scikit-learn wrapper for FastText model #2178

Scikit-learn wrapper for FastText model #2178

mcemilg commented Sep 11, 2018

menshikh-iv left a comment

menshikh-iv Sep 13, 2018

menshikh-iv Sep 13, 2018

menshikh-iv Sep 13, 2018

menshikh-iv Sep 13, 2018

menshikh-iv Sep 13, 2018

mcemilg Sep 13, 2018

menshikh-iv Sep 14, 2018

menshikh-iv Sep 13, 2018

menshikh-iv Sep 17, 2018

mcemilg Sep 17, 2018

mcemilg commented Sep 14, 2018

mcemilg commented Sep 16, 2018

menshikh-iv Sep 18, 2018

mcemilg Sep 18, 2018

menshikh-iv commented Sep 19, 2018

mcemilg commented Sep 20, 2018 •

edited

Loading

menshikh-iv commented Sep 20, 2018 •

edited

Loading

mcemilg commented Sep 20, 2018

menshikh-iv commented Sep 20, 2018

Scikit-learn wrapper for FastText model #2178

Scikit-learn wrapper for FastText model #2178

Conversation

mcemilg commented Sep 11, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcemilg commented Sep 14, 2018

mcemilg commented Sep 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Sep 19, 2018

mcemilg commented Sep 20, 2018 • edited Loading

menshikh-iv commented Sep 20, 2018 • edited Loading

mcemilg commented Sep 20, 2018

menshikh-iv commented Sep 20, 2018

mcemilg commented Sep 20, 2018 •

edited

Loading

menshikh-iv commented Sep 20, 2018 •

edited

Loading