Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed LDA: checking the length of docs instead of the boolean v… #1191

Merged
merged 1 commit into from
Mar 8, 2017

Conversation

saparina
Copy link
Contributor

@saparina saparina commented Mar 7, 2017

…alue
Possibly solve the issue #911

@tmylk
Copy link
Contributor

tmylk commented Mar 7, 2017

Could you please setup the distributed workers on your box and check if it actually solves #911. Have you been able to reproduce #911?

@saparina
Copy link
Contributor Author

saparina commented Mar 8, 2017

@tmylk Yes, I reproduced #911 in the way it's described here :

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import models, corpora
corpus = corpora.MmCorpus('deerwester.mm') # load a corpus of nine documents, from the Tutorials
id2word = corpora.Dictionary.load('deerwester.dict')

lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=100, distributed=True)

I got the same error:

2017-03-08 10:40:51,078 : INFO : loaded corpus index from deerwester.mm.index
2017-03-08 10:40:51,078 : INFO : initializing corpus reader from deerwester.mm
2017-03-08 10:40:51,078 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries
2017-03-08 10:40:51,078 : INFO : loading Dictionary object from deerwester.dict
2017-03-08 10:40:51,078 : INFO : loaded deerwester.dict
2017-03-08 10:40:51,079 : INFO : using symmetric alpha at 0.01
2017-03-08 10:40:51,079 : INFO : using symmetric eta at 0.08333333333333333
2017-03-08 10:40:51,147 : INFO : using distributed version with 2 workers
2017-03-08 10:40:51,163 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
2017-03-08 10:40:51,163 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2017-03-08 10:40:51,163 : INFO : initializing 2 workers
Traceback (most recent call last):
  File "LDA+issue.py", line 9, in <module>
    lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=100, distributed=True)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 334, in __init__
    self.update(corpus, chunks_as_numpy=use_numpy)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 635, in update
    self.log_perplexity(chunk, total_docs=lencorpus)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 526, in log_perplexity
    perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 727, in bound
    gammad, _ = self.inference([doc])
  File "/home/irina/GSoC/gensim/gensim/models/ldamodel.py", line 428, in inference
    if doc and not isinstance(doc[0][0], six.integer_types):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In distributed mode chunks keeps as np arrays and expression like doc == True is incorrect for doc np array.
I check it on one machine with two workers and now it works in distributed mode.

@tmylk tmylk merged commit ed757df into piskvorky:develop Mar 8, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 8, 2017

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants