Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit-learn wrapper for FastText model #2178

Merged
merged 7 commits into from
Sep 19, 2018
Merged

Scikit-learn wrapper for FastText model #2178

merged 7 commits into from
Sep 19, 2018

Conversation

mcemilg
Copy link
Contributor

@mcemilg mcemilg commented Sep 11, 2018

Fixes #2138

Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @mcemilg! Please continue.

BTW I see than CI failed by unrelated reason. This will be fixed when we merged #2127 (after you also need to merge fresh develop to your branch)

>>>
>>> # What is the vector representation of the word 'graph'?
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
>>> assert wordvecs.shape == (2, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)


Parameters
----------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need empty line

batch_words : int, optional
Target size (in words) for batches of examples passed to worker threads (and
thus cython routines).(Larger batches will be passed if individual
texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing empty line (at the end of docstring)


def testConsistencyWithGensimModel(self):
# training a FTTransformer
self.model = FTTransformer(size=10, min_count=0, seed=42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for check this, you also need to pin workers=1 (for both models)

word = texts[0][0]
vec_transformer_api = self.model.transform(word) # vector returned by FTTransformer
vec_gensim_model = gensim_ftmodel[word] # vector returned by FastText
passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-1 looks too large, why this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case please remove it

model_dump = pickle.dumps(self.model)
model_load = pickle.loads(model_dump)

word = texts[0][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass all corpus that you have + check with out-of-vocab words

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcemilg still here, please don't forget to fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed it, thanks.

@mcemilg
Copy link
Contributor Author

mcemilg commented Sep 14, 2018

Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry.

@mcemilg
Copy link
Contributor Author

mcemilg commented Sep 16, 2018

I reset mistaken git commits and I pushed my last changes. Sorry again for trouble.

@@ -1213,5 +1214,111 @@ def testModelNotFitted(self):
self.assertRaises(NotFittedError, phrases_transformer.transform, phrases_sentences[0])


class TestFastTextWrapper(unittest.TestCase):
def setUp(self):
numpy.random.seed(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.random.seed(0) affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed in your code @mcemilg ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I removed the numpy.random.seed calls.

@menshikh-iv
Copy link
Contributor

Thanks @mcemilg, congratz with your first contribution 🥇

@menshikh-iv menshikh-iv merged commit 97783a4 into piskvorky:develop Sep 19, 2018
@mcemilg mcemilg deleted the ft-wrapper-sklearn branch September 20, 2018 18:12
@mcemilg
Copy link
Contributor Author

mcemilg commented Sep 20, 2018

Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 20, 2018

@mcemilg If you have time for it - of course, I will be very grateful 🔥

@mcemilg
Copy link
Contributor Author

mcemilg commented Sep 20, 2018

Okay, I will look it as soon as possible. 👍

@menshikh-iv
Copy link
Contributor

@mcemilg note: global seeding should never happen (i.e have a look through all library, not test only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants