Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding unsupervised FastText to Gensim #1525

Merged
merged 35 commits into from
Sep 19, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR implements FastText model (unsupservised version) in Gensim.

@souravsingh
Copy link
Contributor

There is a PR open here- #1482


for indices in word2_indices:
word2_subwords += ['<' + model.wv.index2word[indices] + '>']
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)
Copy link
Contributor

@jayantj jayantj Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for now, but ideally we'd like a cleaner solution to this later on. In general, I think the FastText wrapper (to load .bin files) and the FastText training code implemented here shares a lot of common ground (both conceptually and code-wise). Once we have the correctness of the models verified, we'd be looking to refactor it somehow (maybe just inheriting from the wrapper? Completely removing train functionality from the wrapper and replacing it with native train functionality?) Any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current implementation shares code with Gensim's Fasttext wrapper so inheriting from the wrapper seems to be good way for avoiding this redundancy.
I think it would also be helpful to refactor the current Word2Vec implementation since apart from using ngrams-vectors rather than word-vectors at the time of backpropagation in fasttext, the logic and code between the two models overlap significantly. Having one common parent class and the two models as the children could be a useful way to tackle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that refactoring to avoid redundancy would be good. I'm not sure a common parent class is the way to go though, since most of the redundant code is in methods train_batch_cbow and train_batch_skipgram, which are both independently defined functions, and not methods of the Word2Vec class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from these training functions, there is overlap between the two models in some other tasks as well. For instance, in our fasttext implementation, we are first constructing the vocabulary in the same way as is done in Word2Vec (i.e. calling scan_vocab, scale_vocab and finalize_vocab functions) and then we are handling all the "fasttext-specific" things (like constructing the dictionary of ngrams and precomputing & storing ngrams for each word in the vocab). These "fasttext-specific" things can be handled at a prior stage (e.g. within scale_vocab or finalize_vocab functions) and this would also help us optimize things e.g. by avoiding iterating over the vocabulary a few times unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The super method is useful in such situations - where the parent class implementation of the method needs to be run along with whatever code is specific to the child class.

for indices in word2_indices:
word2_subwords += ['<' + model.wv.index2word[indices] + '>']
word2_subwords += Ft_Wrapper.compute_ngrams(model.wv.index2word[indices], model.min_n, model.max_n)
word2_subwords = list(set(word2_subwords))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we changed this to no longer be a set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I have pushed those changes now.

if context_locks is None:
context_locks = model.syn0_all_lockf

if word not in model.wv.vocab:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Shouldn't word be necessarily present in vocab anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's correct. This was in the Word2Vec code as well so it slipped through I guess. Thanks for pointing this out!

@tmylk
Copy link
Contributor

tmylk commented Aug 24, 2017

Please mention the slow runtime of Gensim's pure python version in the notebook. The exact time on Lee corpus for example.

@chinmayapancholi13
Copy link
Contributor Author

chinmayapancholi13 commented Aug 25, 2017

Comparison between Gensim's native Python implementation with Facebook's original C++ code

Note:

  • The results here have been obtained after training models on the first 10 MB of text8 corpus with iter=2.
  • sg stands for Skipgram, hs stands for Hierarchical Softmax, neg stands for Negative Sampling and cbow stands for Continuous Bag of Words.
  • Implementation type Gensim refers to the Python code (to be added by this PR) and Wrapper refers to the wrapper (present in Gensim) for fastText's original C++ code.

We have used mainly 3 functions for comparing the 2 implementations:

  1. accuracy()
Training mode Semantic accuracy (Facebook) Semantic accuracy (Gensim) Syntactic accuracy (Facebook) Syntactic accuracy (Gensim)
sg , neg 3.98% (83/2086) 4.41% (92/2086) 32.05% (2053/6405) 36.80% (2357/6405)
sg , hs 9.11% (190/2086) 7.81% (163/2086) 50.98% (3265/6405) 49.99% (3202/6405)
cbow , neg 1.49% (31/2086) 2.25% (47/2086) 22.53% (1443/6405) 28.17% (1804/6405)
cbow , hs 4.60% (96/2086) 2.78% (58/2086) 51.40% (3292/6405) 47.84% (3064/6405)
  1. evaluate_word_pairs()
Training mode Implementation Pearson correlation coefficient Spearman rank-order correlation coefficient
sg , neg Wrapper (0.33571938305084625, 2.7735449357626718e-09) (correlation=0.3319501426263417, pvalue=4.2630631662495745e-09)
sg , neg Gensim (0.37160584013854336, 3.4314357501294739e-11) (correlation=0.37255638854484907, pvalue=3.0313995113129397e-11)
sg , hs Wrapper (0.43164118498657683, 5.9191607832804485e-15) (correlation=0.43202508957275548, pvalue=5.5678855620952807e-15)
sg , hs Gensim (0.44120623358358979, 1.2593563529334038e-15) (correlation=0.43432666642888956, pvalue=3.8520501516355618e-15)
cbow , neg Wrapper (0.30456273736238976, 8.1657874241078728e-08) (correlation=0.31730267261747791, pvalue=2.1454825237853816e-08)
cbow , neg Gensim (0.30577996094983406, 7.2064020538484688e-08) (correlation=0.32652507815616916, pvalue=7.8345327821526968e-09)
cbow , hs Wrapper (0.43941016642738401, 1.690337016383758e-15) (correlation=0.45312926376171292, pvalue=1.706860830188868e-16)
cbow , hs Gensim (0.37971140770563433, 1.1769482912504567e-11) (correlation=0.37195803751309486, pvalue=3.2775570056585581e-11)
  1. most_similar()

For the mode (cbow , neg), on retrieving the top-10 most similar words for the word night we get:

  • Gensim output:
    [(u'midnight', 0.9369428753852844), (u'knight', 0.906793475151062), (u'dwight', 0.8935667276382446), (u'tight', 0.8830252885818481), (u'bright', 0.8659111261367798), (u'tonight', 0.8634316325187683), (u'wight', 0.8603839874267578), (u'tamazight', 0.8542954921722412), (u'nightclubs', 0.8528885245323181), (u'deck', 0.850151002407074)]

  • Wrapper output:
    [(u'midnight', 0.9334157109260559), (u'knight', 0.9281861782073975), (u'dwight', 0.9247788190841675), (u'upright', 0.900949239730835), (u'tight', 0.896233081817627), (u'bright', 0.8925803899765015), (u'wight', 0.8893818259239197), (u'nightingale', 0.8800072073936462), (u'nightclubs', 0.8782978653907776), (u'tamazight', 0.8743830919265747)]

Overlapping words: [ u'midnight', u'knight', u'dwight', u'tight', u'bright', u'wight', u'tamazight', u'nightclubs']
(8 out of 10)

Similar results (overlap of around 7,8) was obtained for other modes as well.

cc: @piskvorky @gojomo

else:
new_vocab_len = len(self.wv.vocab)
for ngram, idx in self.wv.hash2index.items():
self.wv.hash2index[ngram] = idx + new_vocab_len - self.old_vocab_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a good idea to have two separate matrices, one for storing the vectors for the <word> tokens, and one for the subwords. Along with renaming the variable to more intuitive names (we don't really need to follow the syn0 syn1 nomenclature here), that should make code much cleaner.

It also makes resizing much easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@piskvorky
Copy link
Owner

piskvorky commented Aug 30, 2017

@chinmayapancholi13 thanks. What are the differences due to? Why is upright missing from the results, why are there such large swings in the accuracy?

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky There still remain some differences in terms of the "randomness" at training time b/w C++ and Python implementations. These include initialization of ngram-vector matrices, choosing which words are to be downsampled, choosing a reduced window size, randomization in negative sampling (randomly choosing negative words) & hierarchical softmax (tree) specific segments of code and multithreading (worker threads > 1 for the results shown above).

So as far as the difference in accuracy values is concerned, the difference could be due to the relatively small size of corpus (10 MB) used here and the above listed sources for randomness.

The values become closer while using a 100 MB corpus for training as can be seen below:

Training mode Implementation Semantic accuracy Syntactic accuracy
sg , neg Wrapper 4.82% 57.86%
sg , neg Gensim 5.95% 59.83%
sg , hs Wrapper 12.99% 60.89%
sg , hs Gensim 13.16% 60.18%
cbow , neg Wrapper 3.73% 62.82%
cbow , neg Gensim 4.19% 64.61%
cbow , hs Wrapper 10.14% 63.92%
cbow , hs Gensim 7.99% 64.97%

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky And about the point regarding "not-all-top-10-words-matching", that too seems to be because of the above mentioned reasons. On training the model with 100 MB corpus, the top-10 words for the (cbow, neg) model become same as can be seen here:

Gensim:
[(u'midnight', 0.9214520454406738), (u'nightjar', 0.8952612280845642), (u'tonight', 0.8734667897224426), (u'nighthawk', 0.8727679252624512), (u'nightbreed', 0.8692173361778259), (u'nightfall', 0.8459283709526062), (u'nightmare', 0.8459077477455139), (u'nighttime', 0.8353838920593262), (u'mcknight', 0.8227508068084717), (u'nightjars', 0.8224337697029114)]

Wrapper:
[(u'midnight', 0.9323179721832275), (u'nightjar', 0.9195586442947388), (u'nighthawk', 0.8968080282211304), (u'nightfall', 0.8818791508674622), (u'mcknight', 0.8758728504180908), (u'nightbreed', 0.8738420009613037), (u'tonight', 0.8719567656517029), (u'nightmare', 0.857421875), (u'nightjars', 0.8562690019607544), (u'nighttime', 0.8551853895187378)]

Overlap:
set([u'tonight', u'nightjar', u'nighttime', u'nightmare', u'midnight', u'nighthawk', u'mcknight', u'nightbreed', u'nightfall', u'nightjars'])

It is very likely that even for a word which belongs to the top-10 list for one implementation and doesn't for the other, the word might be in the top-15 or top-20 (say) for the one it is missing in. :)

@menshikh-iv
Copy link
Contributor

LGTM for me, very nice job @chinmayapancholi13 🔥 👍
What's you think @piskvorky?

@piskvorky
Copy link
Owner

piskvorky commented Aug 31, 2017

Thanks @chinmayapancholi13 . Most of these items seems RNG-related -- what RNG does the original code use? Any way we can simply replicate it, use the same seed and thus get the same random numbers? (at least for testing, so performance irrelevant here)

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky The original C++ code uses minstd_rand along with uniform_real_distribution to generate the random values used at the various instances in the code (like the reduced window size). If we want the results to be even closer, we could try to emulate this RNG and use it in our code too.

Also, there was some previous discussion on #1482 regarding comparing the outputs of our implementation with the original C++ code so that the results would be very close in observable quality rather than having identical numerical results. I wanted to know if the method in which the comparison has been done till now (using the 3 functions accuracy(), evaluate_word_pairs() and most_similar()) is in the right direction in your opinion?

@piskvorky
Copy link
Owner

piskvorky commented Sep 3, 2017

@chinmayapancholi13 thanks for investigating! I appreciate the thoroughness.

We should go for approximation only if there's no other way -- that's why I'm asking about the RNG. Is it difficult to use the exact same RNG directly from Python?

If we can replicate the original RNG (at least for testing), we don't need any "very close in observable quality" (which is always questionable -- is 2 % "very close" or not?). Instead, we can go for identical, ± some numeric rounding error.

If it's not possible to use the original RNG, we could shrug the differences away. But I'd much prefer to start on the right foot, with a verifiable and verified algo, before commencing optimizations.

@chinmayapancholi13
Copy link
Contributor Author

@piskvorky I agree. We should indeed try to verify the model correctness fully before moving on to doing optimizations and replicating this RNG and incorporating it in the Python code should help to get closer results. I'll try to do this then and give updates if I hit any blockers.

self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
assert self.wv.syn0_all.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))
Copy link

@luthfianto luthfianto Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fasttext_wrapper training breaks here if I use min_count=2 with my own data, while the default min_count=5 is still working.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-08cd9b292aba> in <module>()
----> 1 model.train('/home/rilut/fastText/fasttext', '/datadir/all.csv', model='skipgram', min_count=2)

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in train(cls, ft_path, corpus_file, output_file, model, size, alpha, window, min_count, word_ngrams, loss, sample, negative, iter, min_n, max_n, sorted_vocab, threads)
    221
    222         output = utils.check_output(args=cmd)
--> 223         model = cls.load_fasttext_format(output_file)
    224         cls.delete_training_files(output_file)
    225         return model

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_fasttext_format(cls, model_file, encoding)
    249             model_file += '.bin'
    250         model.file_name = model_file
--> 251         model.load_binary_data(encoding=encoding)
    252         return model
    253

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_binary_data(self, encoding)
    267             self.load_model_params(f)
    268             self.load_dict(f, encoding=encoding)
--> 269             self.load_vectors(f)
    270
    271     def load_model_params(self, file_handle):

/home/rilut/anaconda2/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/wrappers/fasttext.pyc in load_vectors(self, file_handle)
    349         self.num_original_vectors = num_vectors
    350         self.wv.syn0_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
--> 351         self.wv.syn0_ngrams = self.wv.syn0_ngrams.reshape((num_vectors, dim))
    352         assert self.wv.syn0_ngrams.shape == (self.bucket + len(self.wv.vocab), self.vector_size), \
    353             'mismatch between actual weight matrix shape {} and expected shape {}'.format(

ValueError: cannot reshape array of size 211096795 into shape (2121425,100)

Seems we need to modify the num_vectors in reshape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rilut! Using the FastText wrapper for different values of min_count parameter is working fine for me. Could you please share the exact code that is causing the problem in your case?
About num_vectors value, the values of num_vectors and dim are actually read from the files generated from FastText's original C++ code and thus the wrapper code uses the same values at the time of reshaping.
And apologies for the delayed response. I have been a little occupied recently.

@menshikh-iv
Copy link
Contributor

@chinmayapancholi13 hey, what's a status here? When you'll have time to "full-verification"?

@piskvorky
Copy link
Owner

piskvorky commented Sep 14, 2017

Hi guys, I'd like to get this in ASAP, so people can start using it and provide feedback.
Even if the implementation is slow for now -- once we're good with the API and correctness, the optimizations should be straightforward.

@chinmayapancholi13
Copy link
Contributor Author

Hey @menshikh-iv @piskvorky! My apologies for the hiatus in the ongoing work. I have semester exams going on currently and so haven't been able to devote much time to the PR in the last week. I am planning to resume working and complete the work remaining for verifying correctness fully (by replicating the RNG in Python) in about a week's time. Sorry again for the inconvenience and thanks for your patience.

@piskvorky
Copy link
Owner

Good luck with your exams @chinmayapancholi13 👍

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Sep 19, 2017
@menshikh-iv
Copy link
Contributor

FYI @chinmayapancholi13, I'll merge it now (because it's done for current stage and we need a feedback from users).
For the next changes (more RNG control, cythonizaiton) you should create new PR.

Very nice job @chinmayapancholi13 🔥

@menshikh-iv menshikh-iv merged commit 6e51156 into piskvorky:develop Sep 19, 2017
@Liebeck
Copy link

Liebeck commented Sep 21, 2017

I'd like to experiment with Gensim and fasttext. I'm not sure what the current implementation status is.

At this point, the Python implementation mentioned in the notebook is not available via pip, right? This means that, for now, only the c++ wrapper is available in 2.3.0?

@menshikh-iv
Copy link
Contributor

@Liebeck For 2.3.0 - only C++ wrapper, for next version, current implementation will be available.
If you want to start your experiments immediately - you can install gensim from develop branch.

@menshikh-iv
Copy link
Contributor

@Liebeck now this functionality available in latest (3.0.0) gensim version.

@menshikh-iv
Copy link
Contributor

@Liebeck right now @manneshiva optimize our pure-python version, very soon we'll have a very fast version, you can monitor progress - #1742 (will be finished in two weeks, maybe faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants