-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix method estimate_memory
from gensim.models.FastText
& huge performance improvement. Fix #1824
#1916
Conversation
@jbaiter Wow! Please resolve merge-conflicts (this is critical right now). Probably, create new branch (based on fresh When you resolve conflicts - please ping me for review / any help. |
7c6afb2
to
08c464a
Compare
08c464a
to
51a1a6e
Compare
@menshikh-iv So I managed to rebase my changes on the latest develop branch. |
gensim/test/test_fasttext_wrapper.py
Outdated
out_expected_vec = numpy.array([-1.34948218, -0.8686831, -1.51483142, -1.0164026, 0.56272298, | ||
0.66228276, 1.06477463, 1.1355902, -0.80972326, -0.39845538]) | ||
out_expected_vec = numpy.array([-0.33959097, -0.21121596, -0.37212455, -0.25057459, 0.11222091, | ||
0.17517674, 0.26949012, 0.29352987, -0.1930912, -0.09438948]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not correct, the vector seems to differ between different Python versions (the above was with 3.6)
…tPersistenceForOldVersions
I now reverted all changes to the deprecated On Windows there seems to be a memory-related issue, the allocation for the ngram vectors in |
@jbaiter thanks, for memory - this is known issue for appveyour :( I'll look into later, thanks for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @manneshiva can you review this one, please?
gensim/models/fasttext_inner.pyx
Outdated
@@ -317,7 +317,8 @@ def train_batch_sg(model, sentences, alpha, _work, _l1): | |||
continue | |||
indexes[effective_words] = word.index | |||
|
|||
subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]] | |||
subwords = [model.wv.hash2index[ft_hash(subword) % model.bucket] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use only hanging indents (no vertical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, remove the newline and keep the comprehension on a single line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single line (if line <= 120 characters) or something like
subwords = [
....
....
]
gensim/models/utils_any2vec_fast.pyx
Outdated
@@ -0,0 +1,21 @@ | |||
#!/usr/bin/env cython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This related only with n-grams model, I think it's better to move it to fasttext_inner.pyx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would work, since both functions are used by both models.fasttext
and models.keyedvectors
. Moving the functions into models.fasttext
would break, since this would cause a circular import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so, in this case, better to name it as _utils_any2vec.pyx
(similar with _mmreader.pyx
and _matutils.pyx
)
gensim/models/utils_any2vec_fast.pyx
Outdated
# cython: cdivision=True | ||
# coding: utf-8 | ||
|
||
def ft_hash(unicode string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not cdef
with nogil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both functions need to be called from both Python and Cython, so cdef
won't work, but cpdef
should. Will be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nogil
doesn't work, since the function returns a Python object.
gensim/models/utils_any2vec_fast.pyx
Outdated
return h | ||
|
||
|
||
cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not nogil
, you can fix type for word
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the type for word
doesn't really work, since the token might be a str
on Python 2.7. nogil
won't work either, since the function returns a Python object.
gensim/models/utils_any2vec_fast.pyx
Outdated
|
||
|
||
cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n): | ||
cdef unicode extended_word = f'<{word}>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f-strings supports only in 3.6
(and we maintain 2.7
, 3.5
, 3.6
), please use simple concatenation (or any alternative) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in Cython, which, as far as I understood, since 0.24 automatically generates cross-compatible C-code for it. It works fine under 2.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, really? I didn't know about it, thanks for the information!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can read about it here: http://cython.readthedocs.io/en/latest/src/tutorial/strings.html#string-literals
model_neg.build_vocab(new_sentences, update=True) # update vocab | ||
model_neg.train(new_sentences, total_examples=model_neg.corpus_count, epochs=model_neg.iter) | ||
self.assertEqual(len(model_neg.wv.vocab), 14) | ||
self.assertTrue(len(model_neg.wv.ngrams), 271) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you removed part of tests (I mean all removed lines in tests)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the PR, one optimization was to remove the storage of ngrams on the model and solely rely on the hashes. This is why any tests that assert the number of ngrams in the model are no longer neccesary.
With the lack of ngrams, the __contains__
check now also only uses the hashed and bucketed ngrams, which is why a 'real' OOV is a lot rarer (i.e. it will only happen if not all buckets are occupied and none of the ngrams in the token match any occupied bucket).
@jbaiter don't forget to resolve merge-conflict too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbaiter I went through your PR and it looks good to me. Great job!
Just one more small deletion required in /gensim/models/deprecated/fasttext.py
. Please delete,
new_model.wv.ngrams_word = old_model.wv.ngrams_word
new_model.wv.ngrams = old_model.wv.ngrams
and the corresponding asserts in test_fasttext.py
, here.
I also feel we should also evaluate the effect of the changes in this PR on the quality of vectors learnt. Maybe compare the old and new code by training a FastText
model on text8
and looking at the accuracies (using accuracy) of learnt vectors on question-answers.txt
. It would also be interesting to see the memory consumption in both cases.
cc: @menshikh-iv
This removes the expensive calls to `compute_ngrams` and `ft_hash` during training and uses a simple lookup in an int -> int[] mapping instead, resulting in a dramatic increase in training performance.
I ran some benchmarks with my optimized version and the current gensim implementation of FastText. Initially the performance was about 10x slower, but I implemented an optimization that pre-generates the ngram buckets for each word to avoid calling I trained on the
As you can see, the goal of reducing memory consumption was achieved and we additionally almost doubled the training speed, while maintaining evaluation speed. The quality of the vectors seems to suffer a bit, however I think that this might be due to the different random initializations of the two models.
|
@jbaiter first table looks awesome: almost x2 faster + reduce memory usage, fantastic 🔥! About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please? Also, please have a look at Appveyor, I see |
@jbaiter The speedup looks great! Thanks for this contribution.
|
@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not |
Will do, I'll also do a run each with a fixed seed.
Do you have any idea what could be causing this?
Can you recommend a suitable dataset that works with the tests in
Will do!
Yes, the values for subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]] vs subwords = model.wv.buckets_word[word.index] The current version does |
This happens sometimes with appveyour (by memory limit reasons), so, you can try to use the smaller model in this case.
Sample from https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 should be a good idea (you should pick first 1M articles). |
I'll make benchmark myself too ( Code: https://gist.github.com/menshikh-iv/ba8cba26744c668e73b59d5972dabbf8 from gensim.parsing.preprocessing import (
preprocess_string, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short
)
prc = partial(
preprocess_string,
filters=[strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short]
) Model parameters: almost default, only
|
So I ran the As for the causes of the speedup, it seems that my changes somehow result in a significant increase in parallelism, as can be gathered from these performance counters: Perfomance counters for training + evaluation on current
Performance counters for training + evaluation on optimized
Additionally to using a full core more, IPC seems to be higher (1.32 instructions/cycle vs 1.12 before) and the overall number of instructions is lower (could be because of the lower number of lookups?) |
I checked current PR with 500,000 articles from wiki, my results #1916 (comment), this looks exciting 🌟 🔥 impressive work @jbaiter! I also checked, the model from |
@jbaiter last things that missed here
|
This is files from profiler (current relese version + PR version) Script: import gensim.downloader as api
from gensim.models import FastText
data = api.load("text8")
model = FastText(data) @manneshiva this is for you |
estimate_memory
from gensim.models.FastText
& huge performance improvement. Fix #1824
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbaiter thanks a lot for the PR! This looks really great, and that's some serious speedup. Would really appreciate if you could address the comments in my review.
gensim/models/fasttext.py
Outdated
wv.hash2index[ngram_hash] = new_hash_count | ||
wv.ngrams[ngram] = wv.hash2index[ngram_hash] | ||
new_hash_count = new_hash_count + 1 | ||
wv.num_ngram_vectors = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably reduce some variables here - there seems to be some redundancy, if I understand correctly. wv.num_ngram_vectors
, new_hash_count
and len(ngram_indices)
serve effectively the same purpose.
Maybe we could use len(ngram_indices)
within the loop and set wv.num_ngram_vectors
at the end of the loop?
gensim/models/fasttext.py
Outdated
new_ngrams = list(set(new_ngrams)) | ||
wv.num_ngram_vectors += len(new_ngrams) | ||
logger.info("Number of new ngrams is %d", len(new_ngrams)) | ||
if not wv.buckets_word: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this?
gensim/models/fasttext.py
Outdated
new_hash_count = new_hash_count + 1 | ||
else: | ||
wv.ngrams[ngram] = wv.hash2index[ngram_hash] | ||
num_new_ngrams = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some redundancy again with new_hash_count
, num_new_ngrams
.
gensim/models/fasttext.py
Outdated
continue | ||
ngram_indices.append(len(wv.vocab) + ngram_hash) | ||
wv.hash2index[ngram_hash] = wv.num_ngram_vectors | ||
wv.num_ngram_vectors += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be set to len(ngram_indices)
at the end instead (sorry for nitpicking, but we already have very long code for some of these methods)
gensim/models/keyedvectors.py
Outdated
word_vec += ngram_weights[self.ngrams[ngram]] | ||
ngram_hash = _ft_hash(ngram) % self.bucket | ||
if ngram_hash in self.hash2index: | ||
word_vec += ngram_weights[self.hash2index[ngram_hash]] | ||
if word_vec.any(): | ||
return word_vec / len(ngrams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs to be updated to only take into account the ngrams for which hashes were present in self.hash2index
0.58818, | ||
0.57828, | ||
0.75801 | ||
-0.21929, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change? Could it because of the len(ngrams)
issue mentioned in a comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really a crucial question, but it's not because of the len(ngrams)
issue.
The change was honestly simply made because the numbers were pretty similar and I thought that the vectors just changed a bit since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).
But it looks like there might a bug in the old code that has something to do with this: There are a lot more ngram vectors in the loaded model (17004) than there are in the model on disk (2762). This is probably because wv.vectors_ngram = wv.vectors_ngrams.take(ngram_indices, axis=0)
in init_ngrams_post_load
will result in a (num_ngrams_total, ngram_vec_len)
matrix. Shouldn't vectors_ngram
have a shape of (num_buckets, ngram_vec_len)
? At least that's the case in the new code, and it follows from my (not necessarily correct) understanding of how the bucketing in this implementation works.
This sound similar to what was reported in #1779
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).
Ahh right, that makes sense, thanks for explaining.
Re: the number of ngram vectors being greater than num_buckets
(or the number of vectors on disk) - I see why that might have been happening. With a ngram vocab larger than the number of buckets, a lot of ngrams will be mapped to the same indices. And when .take
is passed a list that contains multiple occurrences of the same index, the vector at that index is "taken" multiple times.
For example -
all_vectors = np.array([[0.1, 0.3], [0.3, 0.1]])
taken_vectors = all_vectors.take([0, 1, 0], axis=0)
taken_vectors.shape
>>> (3, 2)
So it wouldn't result in incorrect results, but yeah, it'd result in unexpectedly high memory usage (and kind of blowing out the whole idea of keeping memory usage constant even with increasing ngram vocabs, out of the water). Thanks for investigating this and explaining it!
@menshikh-iv we're good to merge from my side
0.18025, | ||
-0.14128, | ||
0.22508 | ||
-0.49111, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto:
What is the reason for this change? Could it because of the len(ngrams)
issue mentioned in a comment above?
Some missed stuff
|
This PR is an attempt to optimize the memory usage of the FastText model and to provide a more accurate
FastText.estimate_memory
method.Specifically, it implements the following improvements:
ft_hash
functioncompute_ngrams
functionIn its current state this PR does not merge and does not pass the test suite, due to these issues:
The improvements were done before the refactoring of the word embedding models, likely some code will have to be moved aroundThere are some Python2/3 issues with the cythonizedcompute_ngrams
functionngrams
attribute of the model, but in the optimized model I use the ngram hash. Since these hashes are bucketed, an OOV is most often no longer possible (i.e. if all buckets are occupied), i.e. these tests would have to be removed. Is this okay?I think I can do 1) and 2) on my own when I find the time, but for 3) I'd need some help, since I'm not that familiar with the intentions behind the old code.