Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed LDA "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" #911

Closed
tmylk opened this issue Oct 3, 2016 · 15 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@tmylk
Copy link
Contributor

tmylk commented Oct 3, 2016

Raised on the mailing list

In [16]: model = models.LdaModel(corpus, id2word=dictionary, num_topics=2, distributed=True)
2016-10-03 09:21:04,056 : INFO : using symmetric alpha at 0.5
2016-10-03 09:21:04,056 : INFO : using symmetric eta at 0.5
2016-10-03 09:21:04,057 : DEBUG : looking for dispatcher at PYRO:gensim.lda_dispatcher@127.0.0.1:41212
2016-10-03 09:21:04,089 : INFO : using distributed version with 3 workers
2016-10-03 09:21:04,090 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
2016-10-03 09:21:04,090 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2016-10-03 09:21:04,090 : INFO : initializing 3 workers
2016-10-03 09:21:04,097 : DEBUG : bound: at document #0
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-46e2552e21a7> in <module>()
----> 1 model = models.LdaModel(corpus, id2word=dictionary, num_topics=2, distributed=True)

/home/lev/Dropbox/raretech/os/release/gensim/gensim/models/ldamodel.py in __init__(self, corpus, num_topics, id2word, distributed, chunksize, passes, update_every, alpha, eta, decay, offset, eval_every, iterations, gamma_threshold, minimum_probability, random_state, ns_conf)
    344         if corpus is not None:
    345             use_numpy = self.dispatcher is not None
--> 346             self.update(corpus, chunks_as_numpy=use_numpy)
    347 
    348     def init_dir_prior(self, prior, name):

/home/lev/Dropbox/raretech/os/release/gensim/gensim/models/ldamodel.py in update(self, corpus, chunksize, decay, offset, passes, update_every, eval_every, iterations, gamma_threshold, chunks_as_numpy)
    648 
    649                 if eval_every and ((reallen == lencorpus) or ((chunk_no + 1) % (eval_every * self.numworkers) == 0)):
--> 650                     self.log_perplexity(chunk, total_docs=lencorpus)
    651 
    652                 if self.dispatcher:

/home/lev/Dropbox/raretech/os/release/gensim/gensim/models/ldamodel.py in log_perplexity(self, chunk, total_docs)
    539         corpus_words = sum(cnt for document in chunk for _, cnt in document)
    540         subsample_ratio = 1.0 * total_docs / len(chunk)
--> 541         perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
    542         logger.info("%.3f per-word bound, %.1f perplexity estimate based on a held-out corpus of %i documents with %i words" %
    543                     (perwordbound, numpy.exp2(-perwordbound), len(chunk), corpus_words))

/home/lev/Dropbox/raretech/os/release/gensim/gensim/models/ldamodel.py in bound(self, corpus, gamma, subsample_ratio)
    740                 logger.debug("bound: at document #%i", d)
    741             if gamma is None:
--> 742                 gammad, _ = self.inference([doc])
    743             else:
    744                 gammad = gamma[d]

/home/lev/Dropbox/raretech/os/release/gensim/gensim/models/ldamodel.py in inference(self, chunk, collect_sstats)
    438         # to Blei's original LDA-C code, cool!).
    439         for d, doc in enumerate(chunk):
--> 440             if doc and not isinstance(doc[0][0], six.integer_types):
    441                 # make sure the term IDs are ints, otherwise numpy will get upset
    442                 ids = [int(id) for id, _ in doc]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

A potential fix suggested on the mailing list:

What I was able to find out so far is that in LDA_dispatcher.py by printing the value for distributed in **modelparams equals False, although I passed True as the argument. 

Setting distributed as True manually in the dispatcher code makes the errormessage go away, but then I face other problems. Only one worker actually starts... can't tell what I am doing wrong.


@tmylk tmylk added bug Issue described a bug difficulty easy Easy issue: required small fix labels Oct 3, 2016
@harshuljain13
Copy link

@tmylk I am ready to take this up. How shall I proceed to solve this stuff?

@tmylk
Copy link
Contributor Author

tmylk commented Oct 5, 2016

There's a potential fix in the text above. Investigating if it actually works would be very useful

@harshuljain13
Copy link

@tmylk I am unable to replicate the error

from gensim import models
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)

lda = models.LdaModel(processed_corpus, id2word=dictionary, num_topics=100, chunksize=1, distributed=True)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-744b0207c042> in <module>()
     29 dictionary = corpora.Dictionary(processed_corpus)
     30 
---> 31 lda = models.LdaModel(processed_corpus, id2word=dictionary, num_topics=6, chunksize=1, distributed=True)

/usr/local/lib/python2.7/dist-packages/gensim-0.13.2-py2.7-linux-x86_64.egg/gensim/models/ldamodel.pyc in __init__(self, corpus, num_topics, id2word, distributed, chunksize, passes, update_every, alpha, eta, decay, offset, eval_every, iterations, gamma_threshold, minimum_probability, random_state, ns_conf)
    334             except Exception as err:
    335                 logger.error("failed to initialize distributed LDA (%s)", err)
--> 336                 raise RuntimeError("failed to initialize distributed LDA (%s)" % err)
    337 
    338         # Initialize the variational distribution q(beta|lambda)

RuntimeError: failed to initialize distributed LDA (Pyro name server not found)

@markroxor
Copy link
Contributor

@harshul1610 first start the Pyro name server via
python -m Pyro4.naming -n 0.0.0.0 & python -m gensim.models.lda_worker & python -m gensim.models.lda_dispatcher &

@harshuljain13
Copy link

@tmylk @markroxor I am running on pyro4(4.47) but getting following:

from gensim import models
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)

lda = models.LdaModel(processed_corpus, id2word=dictionary, num_topics=100, chunksize=1, distributed=True)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-7afff73717bf> in <module>()
     29 dictionary = corpora.Dictionary(processed_corpus)
     30 
---> 31 lda = models.LdaModel(processed_corpus, id2word=dictionary, num_topics=100, chunksize=1, distributed=True)

/usr/local/lib/python2.7/dist-packages/gensim-0.13.2-py2.7-linux-x86_64.egg/gensim/models/ldamodel.pyc in __init__(self, corpus, num_topics, id2word, distributed, chunksize, passes, update_every, alpha, eta, decay, offset, eval_every, iterations, gamma_threshold, minimum_probability, random_state, ns_conf)
    334             except Exception as err:
    335                 logger.error("failed to initialize distributed LDA (%s)", err)
--> 336                 raise RuntimeError("failed to initialize distributed LDA (%s)" % err)
    337 
    338         # Initialize the variational distribution q(beta|lambda)

RuntimeError: failed to initialize distributed LDA (unsupported serialized class: gensim.corpora.dictionary.Dictionary)

Any suggestions?

@markroxor
Copy link
Contributor

@harshul1610 please take a look at #924 and this.

@hiral2cool
Copy link

@tmylk
how to slove this error
"Distributed LDA "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

please give proper solution

you told "There's a potential fix in the text above. Investigating if it actually works would be very useful"
and but in which text ???

please i m stuck up with this from 2 days too much find but nothing find please help me out

@tmylk
Copy link
Contributor Author

tmylk commented Oct 13, 2016

@hiral2cool Would be grateful for your feedback on this potential fix suggested on the mailing list (also in the main description of this issue):

What I was able to find out so far is that in LDA_dispatcher.py by printing the value for distributed in **modelparams equals False, although I passed True as the argument. 

Setting distributed as True manually in the dispatcher code makes the errormessage go away, but then I face other problems. Only one worker actually starts... can't tell what I am doing wrong.


@arjun180
Copy link

arjun180 commented Oct 27, 2016

@tmylk @markroxor
Hello,

I have tried to reproduce the issue and I used the following steps:

export PYRO_SERIALIZERS_ACCEPTED=pickle
export PYRO_SERIALIZER=pickle
python -m Pyro4.naming -n 0.0.0.0 & python -m gensim.models.lda_worker & python -m gensim.models.lda_dispatcher &

from gensim import models
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)

lda = models.LdaModel(processed_corpus, id2word=dictionary, num_topics=100, chunksize=1, distributed=True)
-----------------------------------------------------------------------------------------------------
2016-10-26 20:55:17,388 : INFO : 'pattern' package not found; tag filters are not available for English
2016-10-26 20:55:17,623 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-10-26 20:55:17,623 : INFO : built Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...) from 9 documents (total 29 corpus positions)
2016-10-26 20:55:17,623 : INFO : using symmetric alpha at 0.01
2016-10-26 20:55:17,623 : INFO : using symmetric eta at 0.01
2016-10-26 20:55:17,670 : INFO : using distributed version with 1 workers
2016-10-26 20:55:17,677 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 1 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
2016-10-26 20:55:17,677 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2016-10-26 20:55:17,678 : INFO : initializing 1 workers
2016-10-26 20:55:17,685 : INFO : PROGRESS: pass 0, dispatching documents up to #1/9
2016-10-26 20:55:17,686 : INFO : reached the end of input; now waiting for all remaining jobs to finish
Exception in thread oneway-call:
Traceback (most recent call last):
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/Pyro4-4.49-py2.7.egg/Pyro4/core.py", line 1616, in run
    super(_OnewayCallThread, self).run()
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/gensim/models/lda_worker.py", line 74, in requestjob
    self.processjob(job)
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/gensim/utils.py", line 98, in _synchronizer
    result = func(self, *args, **kwargs)
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/gensim/models/lda_worker.py", line 83, in processjob
    self.model.do_estep(job)
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 499, in do_estep
    gamma, sstats = self.inference(chunk, collect_sstats=True)
  File "/Users/arjunchakraborty/anaconda2/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 442, in inference
    if doc and not isinstance(doc[0][0], six.integer_types):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The potential fix in the mailing list suggested that **models_params in lda_dispatcher.py be set to true. Since _models_params refers to a potential list of arguments, I am assuming this means _distributed==True

Would I be correct in assuming that the author of the mailing list post suggested this as a fix in the initialize function in lda_dispatcher.py:

for a in model_params:
            if a == "distributed":
                distributed = True

This did not seem to have an effect on the error. I am wondering if I am misinterpreting the bug fix suggested by the author.

@tmylk
Copy link
Contributor Author

tmylk commented Oct 28, 2016

The mailing list fix is that if distributed==True in lda_model then it should be true in lda_dispatcher.py.
In your fix it is always true which is not correct.

Anyway, it would be good to get rid of the error message, maybe with some other fix.

@tmylk tmylk added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Nov 10, 2016
@piskvorky
Copy link
Owner

piskvorky commented Dec 13, 2016

@tmylk Isn't it enough to replace if doc with if len(doc) == 0? That should work both for lists and numpy arrays.

I remember there were some changes around the as_numpy parameter in chunksize, and this bug may be related.

I don't see how it's related to the distributed parameter in any way.

@akcom
Copy link

akcom commented Feb 3, 2017

@piskvorky - agreed, this is unrelated to distributed. I just came across the same error trying to infer topic distributions using the LdaMulticore class which suffers from the same bug. The problem is in ldamodel.py.

By replacing if doc with if len(doc) > 0 we've kicked the bucket a bit further:

[int(id) for id, _ in doc]
now throws:
ValueError: too many values to unpack (expected 2)

I'm not terribly familiar with how numpy handles list comprehensions over nparray's, I'll have to look into it a bit more.

edit: my mistake, just realized the model is expect a list of 2-tuples, not the nparray I was trying to pass. Makes sense.

@tmylk
Copy link
Contributor Author

tmylk commented Feb 22, 2017

@akcom What is your setup? Struggling to understand how ValueError: The truth value of an array... can occur. The tests to infer a vector pass in continuous integration with the if doc code

@akcom
Copy link

akcom commented Feb 22, 2017

Hi tmylk - My input is in a COO/triplet sparse matrix format. I load it in using numpy:

sm = coo_matrix((entry_val, (row_num, col_num)), shape=(len(doc2id_map), len(word2id_map)).tocsr()
corp = Sparse2Corpus(sm, False)

If memory serves, I made the mistake of passing my sparse matrix (sm) to LdaMulticore inference method instead of the gensim corpus (corp). By passing in the numpy sparse matrix, I arrived at the error above.

@tmylk
Copy link
Contributor Author

tmylk commented Mar 8, 2017

Original issue fixed in #1191
The issue by @akcom is also closed "as an incorrect usage issue."

@tmylk tmylk closed this as completed Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

7 participants