-
Notifications
You must be signed in to change notification settings - Fork 4.7k
mismatch in vec and bin files in french pretrained vector #218
Comments
I looked into this further - the fact that the additional word was So I'm assuming the French wiki has the term The actual term I'm surprised the term We've written a Python wrapper as part of Gensim to allow users to load FastText models and use word vector functionality already present in gensim, and this bug was affecting some of our users - piskvorky/gensim#1236. Is this likely to be fixed in the near future, or is it too niche? If so, we don't mind adding a workaround in our wrapper. |
Hello @prakhar2b, This issues has recently been resolved. We updated the models and .vec files to resolve this mismatch. Now all the vectors in both the bin+text and text version should match. Please feel free to reopen at any time if that is not the case. Thanks, |
For all other (pretrained) vector models,
vocab_size
obtained fromvec
file is equal to thesize
andnwords
obtained frombin
file. ( this line)But for
wiki.fr
,vocab_size
is1152449
,size
is1152450
, andnwords
is1152449
. Further analysing, this additionalvocab_word
isu'__label__'
which is not present invec
file, or any other pretrained vector model.This doesn't cause any bug in fastText code, but I find it little unusual. It would be really helpful, if somebody could provide an insight or explanation behind this.
Note: - This is important because often, it is more convenient to load vectors from
vec
file and additionalparameters
frombin
file. This sort of mismatch causes unnecessary complexity in the codes.The text was updated successfully, but these errors were encountered: