Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText wrapper returns inconsistent dtypes #1637

Closed
mcobzarenco opened this issue Oct 19, 2017 · 1 comment
Closed

FastText wrapper returns inconsistent dtypes #1637

mcobzarenco opened this issue Oct 19, 2017 · 1 comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@mcobzarenco
Copy link
Contributor

mcobzarenco commented Oct 19, 2017

Description

gensim.models.wrappers.FastText returns inconsistent dtypes.

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText
embeds = FastText.load_fasttext_format(...)

For an existing word:

embeds['the'].dtype == dtype('float32')

For an "imputed" word (missing from the vocabulary). The word embedding is computed as the sum of embedding for n-grams:

embeds['ttttt'].dtype == dtype('float64')

The problem in models/wrappers/fasttext.py::FastTextKeyedVectors.word_vec. In the case of a missing word, the zero vector is initialised to be a 64-bit float array to which a bunch of 32-bit embeddings are added to.

Versions

Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609]
NumPy 1.13.3
SciPy 0.19.1
gensim 3.0.1
FAST_VERSION 1

@piskvorky
Copy link
Owner

Nice catch @mcobzarenco ! Thanks.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Oct 19, 2017
horpto pushed a commit to horpto/gensim that referenced this issue Oct 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

3 participants