Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2Vec.infer_vector: AttributeError: 'Doc2Vec' object has no attribute 'syn1' #483

Closed
codingluke opened this issue Oct 16, 2015 · 19 comments
Assignees
Labels
bug Issue described a bug conda difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@codingluke
Copy link

Hi all,
I trained a Doc2Vec model successfully with the data of the Kaggle Tutorial "Bag of Words Meets Bags of Popcorn" https://www.kaggle.com/c/word2vec-nlp-tutorial/data. The methods most_similar and doesnt_match are working like expected.

However, when I use the infer_vector method, the error AttributeError: 'Doc2Vec' object has no attribute 'syn1' arises. When I check the model, there is just an model.syn0 available.

Systeminfo

MacOSX 10.10.5,
Python 2.7.10

Packages (I don't use Cyclone at the moment...)

boto (2.38.0)
bz2file (0.98)
gensim (0.12.2)
httpretty (0.8.6)
numpy (1.10.1)
pip (7.1.2)
requests (2.8.1)
scipy (0.16.0)
setuptools (18.2)
six (1.10.0)
smart-open (1.3.0)
wheel (0.24.0

Example in IPython

In [5]: from gensim.models import Doc2Vec
In [6]: model = Doc2Vec.load('./Doc2Vec300features_40minwords_10context')
In [7]: model.infer_vector("hallo ich bin ein text".split())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-c4a827fd56d1> in <module>()
----> 1 model.infer_vector("hallo ich bin ein text".split())

/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec.pyc in infer_vector(self, doc_words, alpha, min_alpha, steps)
    694                 train_document_dm(self, doc_words, doctag_indexes, alpha, work, neu1,
    695                                   learn_words=False, learn_hidden=False,
--> 696                                   doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
    697             alpha = ((alpha - min_alpha) / (steps - i)) + min_alpha
    698

/Users/{myuser}/Documents/Dev/virt_env2/lib/python2.7/site-packages/gensim/models/doc2vec_inner.pyx in gensim.models.doc2vec_inner.train_document_dm (./gensim/models/doc2vec_inner.c:4736)()
    419
    420     if hs:
--> 421         syn1 = <REAL_t *>(np.PyArray_DATA(model.syn1))
    422
    423     if negative:

AttributeError: 'Doc2Vec' object has no attribute 'syn1'

THX for your Help! :)

@codingluke
Copy link
Author

I could handle the error i two ways:

  1. setting the parameter hs=0 by initializing the model or
  2. not calling model.init_sims()

I'm not a deep expert in this topic, but I think the hs (hierachical sampling for training) seems to me important isn't it? Also the init_sims which is "freezing" the model, so that its faster an smaller, is a good thing.

@codingluke
Copy link
Author

I just figured out. it also works with init_sims(replace=False).

In [9]: from gensim.models import Doc2Vec
In [10]: model = Doc2Vec.load('./Doc2VecMini300features_40minwords_10context')
In [11]: type(model.syn1)
Out[11]: numpy.ndarray
In [12]: model.init_sims(replace=True)
In [13]: type(model.syn1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-5f166d4ec376> in <module>()
----> 1 type(model.syn1)

AttributeError: 'Doc2Vec' object has no attribute 'syn1'

init_sims(replace=True) seems do delete the attribute syn1 from the model which is used by infer_vector.

@codingluke
Copy link
Author

I think it's clear now. model.infer_vectors trains the new documents with the neural weights of the actual model (https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py#L684).
As model.init_sims(replace=True) is deleting them for memory save reasons, the method model.infer_vectors can not work. It's the same reason why model.train is not working after model.init_sims(replace=True).

When I'm right, it might be good to give an appropriate error/warning, or/and add a comment in the docs.

@gojomo
Copy link
Collaborator

gojomo commented Oct 16, 2015

Thanks for your report. Yes, inference works almost exactly like training, so a model with training-state discarded won't be able to reasonably infer either. The comment for init_sims(replace=True) could be a bit clearer.

This area might benefit from a bit more renaming/commenting/refactoring for full clarity, for a few reasons related to what...

  • The fact that init_sims(replace=True) will clear syn1 (which is only used when hs=1) but not syn1neg (which serves the same role when negative>0) is a bit inconsistent.
  • There is one variant of Doc2Vec training – pure DBOW – that doesn't initialize or need the words syn0 at all, but would still need the syn1 or syn1neg. So if for some reason you did bother to call init_sims(replace=True) on it, its ability to infer would survive if it were based on negative sampling (since syn1neg isn't discarded)... but would break if using hierarchical-softmax (since syn1 is discarded). So it's unclear if init_sims(replace=True) should imply 'minimize my model in all ways', or if that should become a different explicit step (as suggested in Word2Vec/Doc2Vec offer model-minimization method #446).
  • in a few projects/papers it has been mentioned that the word vectors created out of a concatenation of the syn0 ('context') and syn1/syn1neg ('prediction') can outperform the plain syn0 context vectors. Supporting experiments with that would further change the situations when the syn1/syn1neg should be consulted/discarded.

@gojomo gojomo self-assigned this Oct 16, 2015
@gojomo gojomo added bug Issue described a bug documentation Current issue related to documentation labels Oct 16, 2015
@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2016

@gojomo Should this be marked as easy?

@gojomo
Copy link
Collaborator

gojomo commented Jan 12, 2016

@tmylk these are ease-of-use/least-surprise/ease-of-understanding that overlap a bit with the expressed interest of #446... none of the edits are hard but deciding what makes the most sense to the average user would take some familiarity with the code/uses.

@Cumberbatch08
Copy link

Cumberbatch08 commented Oct 9, 2017

m = g.Doc2Vec.load(saved_path) #load model
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r","utf-8").readlines() ]
output = open(output_file, "w")
for d in test_docs:
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")

when I run these code, I get the error:AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'
so , what should I do to set the parameters,? I am a beginner, thank you!

@gojomo
Copy link
Collaborator

gojomo commented Oct 9, 2017

@StevenChen1993 - Are you receiving a "slow version" warning in logs when you use Doc2Vec? neg_labels is a part of the model only needed/created when the optimized code is unavailable. So you could see this message if the model were created in an environment where gensim was fully installed (training had access to the optimized code), but then re-loaded to an environment where installation of the optimized variants failed. The best fix would be to make sure your deployment installation has the optimized paths (isn't getting the "slow version" message), perhaps by uninstalling and reinstalling gensim and watching for any errors. Otherwise, training/inference could be 100x slower for that environment. (Alternatively, you could patch a neg_labels into your loaded model like down here in the slow path and use the slow inference.)

@Cumberbatch08
Copy link

thanks for your answer!

@felixsmueller
Copy link

Hi

I had the same exception. The problem was that the model was trained using the fast version but when I installed gensim (3.8.0) on Windows I did not get a warning that the slow version was used.

I followed the instruction on https://radimrehurek.com/gensim/install.html which then successfully installed the fast version of Gensim (3.8.0) on Windows:
conda install -c conda-forge gensim

PS:
The following did NOT install the fast version on Windows and neither did it print a warning that the slow versino was used:
conda install gensim

@piskvorky
Copy link
Owner

piskvorky commented Sep 17, 2019

Thanks @felixsmueller ; cross-linking to #2600 .

@mpenkov do we instruct people to use the conda-forge instead? I always forget what does what, I'm not familiar / a fan of that ecosystem.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 28, 2019

@piskvorky I'm totally unfamiliar with conda myself. Do any of the gensim developers actually use it? If yes, it'd be good for that person to handle it. If no, then I suppose "one of us" could dedicate some time towards learning more about it, and then come back to solving this problem, although I must admit it isn't a particularly tempting endeavor.

I'm also struggling to understand whether we're dealing with a problem in gensim proper, or if it's a problem with the feedstock (https://github.com/conda-forge/gensim-feedstock/).

@piskvorky
Copy link
Owner

piskvorky commented Sep 28, 2019

I believe @gojomo has used it.

I guess having binary wheels for Windows fixes most of such issues – we can now just tell people to do pip install. And forget about debugging and updating the proprietary conda ecosystem.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 29, 2019

+1

@gojomo @menshikh-iv Any thoughts?

@mpenkov mpenkov added the conda label Sep 29, 2019
@menshikh-iv
Copy link
Contributor

@mpenkov up to you, conda widely used by the data-science community, for this reason, I'm -1 to drop that.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 29, 2019

I wonder if we can find a conda zealot who is willing to maintain the feedstock officially. Essentially, a new maintainer for https://github.com/conda-forge/gensim-feedstock/ and a go-to person for conda issues...

@piskvorky
Copy link
Owner

piskvorky commented Sep 29, 2019

I think there are 3 ways to install Gensim in the conda ecosystem:

  1. Using Anaconda (some sort of pre-packaged platform, they charge money for some versions)
    • packages inside, incl. Gensim, are updated by the Continuum Analytics team. I don't think we can do upgrades ourselves.
  2. Using an "external" conda-forge channel, which is open source.
    • We could upgrade ourselves, if there's someone to do the maintenance and support. I have zero interest myself.
  3. Using normal pip.
    • Easiest option, no extra work for us.

Though I may have messed that up completely! Someone correct me.

@gojomo
Copy link
Collaborator

gojomo commented Sep 29, 2019

I usually like to use (mini)conda to manage my dev environment. (The 'mini' version because I don't want the installation-overhead/complexity/etc of the full 'anaconda' package set.) I tend to install jupyter, numpy, scipy via the native conda installation – to be sure to get their well-maintained/well-optimized versions of those central packages from their repo – but then just pip install things like gensim. That's worked well enough for me, on MacOS & Linux OSes, and handles the same whether using Python 2 or Python 3 (without needing different virtual-environment helpers).

So from my perspective: We don't have to do any extra conda-work, or worry about other 'conda-forge' repos, or whatever – just encourage people to use pip install, no matter their environment.

(But also: this all seems a digression from what I see as the real reason for this bug-report: tiny behavioral differences between the 'optimized' and 'pure-python' paths, plus other recurring issues where the optimized code isn't available. Dropping the pure-python paths entirely will simplify maintenance immensely, though the code would then be less useful as a teaching tool.)

@gojomo
Copy link
Collaborator

gojomo commented Mar 1, 2022

As we no longer have the possibly-divergent pure-Python paths, I don't think this should recur. If that assumption is wrong, feel free to re-open w/ details.

@gojomo gojomo closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug conda difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

8 participants