-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support both old and new fastText model #1319
Conversation
test failing as data used Note :- As mentioned in this comment, facebook made changes last week to introduce two variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style nitpick; thanks for the PR!
gensim/models/wrappers/fasttext.py
Outdated
@@ -256,7 +256,7 @@ def load_binary_data(self, model_binary_file): | |||
self.load_vectors(f) | |||
|
|||
def load_model_params(self, file_handle): | |||
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d') | |||
(_,_,dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@14i1d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma. Outer brackets not needed.
gensim/models/wrappers/fasttext.py
Outdated
@@ -275,7 +275,7 @@ def load_dict(self, file_handle): | |||
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) | |||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | |||
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | |||
ntokens, = self.struct_unpack(file_handle, '@q') | |||
ntokens,pruneidx_size = self.struct_unpack(file_handle, '@2q') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after comma.
gensim/models/wrappers/fasttext.py
Outdated
@@ -289,6 +289,11 @@ def load_dict(self, file_handle): | |||
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index' | |||
self.wv.vocab[word].count = count | |||
|
|||
for j in range(pruneidx_size): | |||
_,_ = self.struct_unpack(file_handle,'@2i') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces after commas.
gensim/models/wrappers/fasttext.py
Outdated
for j in range(pruneidx_size): | ||
_,_ = self.struct_unpack(file_handle,'@2i') | ||
|
||
_ = self.struct_unpack(file_handle,'@?') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dtto
And do we continue to support the old format too? Or will everyone's code (and models) break after this change? Are the FastText datasets that people commonly use still load-able (FB released models in many languages some time ago IIRC). |
@piskvorky the fastText datasets (pretrained vectors) are updated according to the recent changes (see this update 12 days ago). The issue #1301 (about Hebrew pretrained vector) was because of these changes only. Issue 1236 (about French pretrained vector) is a glitch from facebook's side, not in gensim's code. |
@prakhar2b I think we should definitely be supporting loading both the old and new FastText models with the gensim fasttext wrapper. FastText seem to have updated the links on their page with the newer models, however there might still be a number of people using the old models. Also, adding tests for loading both older and newer format models would be good. |
@tmylk @menshikh-iv please look into the branch conflict |
gensim/test/test_fasttext_wrapper.py
Outdated
#Test model successfully loaded from fastText (new format) .vec and .bin files | ||
new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file) | ||
vocab_size, model_size = 1763, 10 | ||
self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this and the subsequent assert statements make use of new_model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertEqual(self.`test_new_model`.wv.syn0.shape, (vocab_size, model_size))
new_model
is used here after assert statements
self.model_sanity(new_model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so why not for the other assert statements? If I understand correctly, the whole point of the test is to make sure the model we loaded (new_model
) has been loaded correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_new_model
is defined here in line 38. Only new model is being checked in this function testLoadFastTextNewFormat
. Please compare it with function testLoadFastTextFormat
to see the context.
gensim/models/wrappers/fasttext.py
Outdated
|
||
def load_vectors(self, file_handle): | ||
if self.new_format: | ||
_ = self.struct_unpack(file_handle,'@?') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment clarifying what this is for?
gensim/models/wrappers/fasttext.py
Outdated
logger.warnings("Please report to gensim or fastText.") | ||
ntokens= self.struct_unpack(file_handle, '@1q') | ||
if self.new_format: | ||
pruneidx_size = self.struct_unpack(file_handle, '@q') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pruneidx_size
seems to be used in the original FastText code to load dictionaries -
void Dictionary::load(std::istream& in) { words_.clear(); std::fill(word2int_.begin(), word2int_.end(), -1); in.read((char*) &size_, sizeof(int32_t)); in.read((char*) &nwords_, sizeof(int32_t)); in.read((char*) &nlabels_, sizeof(int32_t)); in.read((char*) &ntokens_, sizeof(int64_t)); in.read((char*) &pruneidx_size_, sizeof(int64_t)); for (int32_t i = 0; i < size_; i++) { char c; entry e; while ((c = in.get()) != 0) { e.word.push_back(c); } in.read((char*) &e.count, sizeof(int64_t)); in.read((char*) &e.type, sizeof(entry_type)); words_.push_back(e); word2int_[find(e.word)] = i; } pruneidx_.clear(); for (int32_t i = 0; i < pruneidx_size_; i++) { int32_t first; int32_t second; in.read((char*) &first, sizeof(int32_t)); in.read((char*) &second, sizeof(int32_t)); pruneidx_[first] = second; } initTableDiscard(); initNgrams(); }
I'm not sure if it's present in the models we're loading or not though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for j in range(pruneidx_size):
_,_ = self.struct_unpack(file_handle,'@2i')
Presence (or absence) of these two lines doesn't affect the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see those two lines in the code. Do you mean pruneidx_size
is always zero, and it doesn't matter whether we keep those lines or not?
If yes, we should add an assert statement, since if pruneidx_size
isn't zero and we don't read the corresponding bytes, the subsequent bytes are not going to be parsed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's -1 actually. Added these two lines in the new commit. I think we should not use assert statement here as some models might have non-negative values, so adding these two lines should be sufficient.
gensim/models/wrappers/fasttext.py
Outdated
def load_dict(self, file_handle): | ||
(vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i') | ||
def load_dict(self, file_handle, encoding='utf8'): | ||
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer keeping the changes related to the issue with the french wiki in a separate PR. We don't want those changes to block this PR from being merged.
In general, I think it's good practice to keep fixes for different issues in different branches/PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separated the PRs. Thanks :)
gensim/models/wrappers/fasttext.py
Outdated
if magic == 793712314: # newer format | ||
self.new_format = True | ||
dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t = self.struct_unpack(file_handle, '@12i1d') | ||
else: # older format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to set self.new_format = False
inside this else
, rather than inside initialize_word_vectors
gensim/models/wrappers/fasttext.py
Outdated
@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file): | |||
self.load_vectors(f) | |||
|
|||
def load_model_params(self, file_handle): | |||
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d') | |||
magic, v= self.struct_unpack(file_handle, '@2i') | |||
if magic == 793712314: # newer format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to store this value in a global variable. FASTTEXT_MAGIC_HEADER
or something similar.
gensim/models/wrappers/fasttext.py
Outdated
@@ -256,7 +257,14 @@ def load_binary_data(self, model_binary_file): | |||
self.load_vectors(f) | |||
|
|||
def load_model_params(self, file_handle): | |||
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(file_handle, '@12i1d') | |||
magic, v= self.struct_unpack(file_handle, '@2i') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style - keep 1 space before and after =
Also, version
would be preferable to v
.
Left some comments, but apart from that, looks good! What issues are you facing with the branch conflict? |
The old fasttext tests should still pass - strange to see them broken. |
I pulled 'mismatch' branch, and re-install
|
ohh, the last PR was only a one word change, so in excitement I entered True instead of False, and committed without testing. The old fasttext tests are not failing now. But, the new fasttext models should not have thrown error anyway. The only error is about assigned variable never being used. How should I handle it, it is required @tmylk @xiaobaozi34 I checked, it's running perfectly with this PR on my system. Could you please re-run it with logging and show me the logs please. Thank you |
gensim/models/wrappers/fasttext.py
Outdated
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) | ||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | ||
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | ||
ntokens, = self.struct_unpack(file_handle, '@q') | ||
ntokens = self.struct_unpack(file_handle, '@1q') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could change this to simply self.struct_unpack(file_handle, '@1q') # number of tokens - unused
to remove the unused variable.
gensim/models/wrappers/fasttext.py
Outdated
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index' | ||
self.wv.vocab[word].count = count | ||
|
||
if self.new_format: | ||
for j in range(pruneidx_size): | ||
_, _ = self.struct_unpack(file_handle, '@2i') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
gensim/models/wrappers/fasttext.py
Outdated
def load_vectors(self, file_handle): | ||
if self.new_format: | ||
_ = self.struct_unpack(file_handle, '@?') # bool quant_input in fasttext.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@prakhar2b Left a couple of minor comments re failing build, we should be good to go after that. Thanks for taking care of this. A couple of tips for the future -
|
gensim/models/wrappers/fasttext.py
Outdated
@@ -286,7 +286,7 @@ def load_dict(self, file_handle, encoding='utf8'): | |||
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) | |||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | |||
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | |||
ntokens = self.struct_unpack(file_handle, '@1q') | |||
self.struct_unpack(file_handle, '@1q') # number of tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: Inline comments should be separated by at least two spaces from the statement
Nice work @prakhar2b 🥇 |
Looking at the recent release train, would this probably go into some newer release within a month or so? |
@matanster yes, we plan the release on the next week. |
Fix #1301 . Support both old and new fastText model.