Zero probabilities in LDA model #2418

piskvorky · 2019-03-15T11:51:25Z

Problem description

A user reported "empty" topics (all probabilities zero), during LdaModel training:
https://groups.google.com/forum/#!topic/gensim/LuPD2VSouSQ

Apparently some of the recent optimizations in #1656 (and maybe elsewhere?) introduced numeric instabilities.

Steps/code/corpus to reproduce

Unknown. Probably related to large data size: large vocabulary in combination with large number of topics, leading to float32 under/overflows.

User reported that changing the dtype back to float64 helped and the "empty topics" problem went away.

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

The text was updated successfully, but these errors were encountered:

enys · 2019-03-15T14:48:16Z

Hi @piskvorky,

Apparently it was one of my team members.
Please find attached the output:

Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
>>> import sys; print("Python", sys.version)
Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1

enys · 2019-03-15T14:57:10Z

It is indeed with a combination of large vocabulary + topics. Topics 500 and 1000 suffer the problem. Our dict size is > 300K. We also use online updates chunks of 100K documents, with a target total corpus size of 50M

horpto · 2019-03-15T20:16:13Z

Hi, @enys
Can you share a minimal dataset that reproduce the problem?

enys · 2019-03-15T21:18:03Z

Quick answer, no.
Dictionary contains 519K corpus is built from precalculated bagwords. I will paste my parameters.
I could try ton build a random corpus/dict if there is a high probability that it is due to cardinality.
I have a run computing over the weekend.

horpto · 2019-03-16T14:39:44Z

Quite stupid question: how can a topic probabilities be a fully zeros if show_topics normalizes topic row ? https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L1163

enys · 2019-03-19T09:13:00Z

Hi @horpto, Sorry for the late reply.
Fully might be a slight over statement, however it renders :

[(178, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(299, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(281, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(208, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(485, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(72, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(65, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(332, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(267, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(75, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

horpto · 2019-06-09T15:14:43Z

@enys, sorry for late response, Lda.show_topics shows only the first 10 words from the topic by default and moreover, it rounds topic probabilities to 3 digits after the point. Can you show topics with formatted=False and quite large num_words parameter?

piskvorky · 2019-10-08T09:13:07Z

Ping @enys are you able to share a reproducible example? We'd like to get to the bottom of this.

davidalbertonogueira · 2020-08-04T10:25:50Z

I have this issue (zero probabilities for words in show_topics) only when using gensim.models.LdaMulticore. Output of gensim.models.ldamodel.LdaModel is as expected.

piskvorky · 2020-08-04T10:41:53Z

@davidalbertonogueira Same comments as above apply.

davidalbertonogueira · 2020-08-04T11:18:57Z

I'm sorry but my current dataset is proprietary. I reckon that I could try to create a small example that generates the same error, but I would have to do it with online publically available data, and therefore, there's no point in doing that myself.

I share the dimensions in case it helps someone trying to replicate the error:
len(gensim_corpus) = 109'000
len(gensim_dictionary) = 13989
n_topics = 10

piskvorky · 2020-08-04T11:43:43Z

@davidalbertonogueira that seems different from the issue reported here, which had a huge (500k) vocabulary and lots of topics (1000). In your case, you have only 14k vocab + 10 topics. Likely unrelated, a separate issue.

davidalbertonogueira · 2020-08-04T13:48:37Z

Should I open a new issue then? @piskvorky

piskvorky · 2020-08-04T15:18:11Z

Only if you're able to include the reproducing example :) Otherwise there isn't much we'll be able to do anyway. Thanks.

SphtKr · 2021-05-26T10:01:40Z

I have far less experience than the other reporters (i.e. it could be something I'm doing wrong), but I'm seeing the same thing--one or more topics with near-zero probabilities, and the terms are usually alphabetically contiguous. My corpus is derived from the Yelp Dataset Challenge licensed for academic use...I may be able to share the contents but unsure, I'll have to read closely. However, it's also very small and I'm doing a small number of topics (10-100)...again, could be something naive I'm doing.

My code looks like this... the very low max_df is on purpose as I was trying to get rid of lots of irrelevant features cheaply... if nothing else looks stupid, I can try to contribute a reproduction.

    vectorizer = TfidfVectorizer(max_df=0.1, max_features=numfeatures,
                                     min_df=2, stop_words='english',
                                     use_idf=True, ngram_range=(1,3))

    X = vectorizer.fit_transform(text)

    id2words ={}
    for i,word in enumerate(vectorizer.get_feature_names()):
        id2words[i] = word

    corpus = matutils.Sparse2Corpus(X,  documents_columns=False)

    lda = models.ldamodel.LdaModel(corpus=corpus,
                                   id2word=id2words,
                                   num_topics=10, 
                                   update_every=1,
                                   chunksize=100,
                                   passes=5,
                                   alpha='auto',
                                   per_word_topics=True)

In the latest run that runs over 74310 documents with 100000 features.

Then I dump the topics to a text file (among other things) and my "empty" topic looks like this:

Topic: 4
chino : 1e-05
pei wei : 1e-05
taguara : 1e-05
gelato spot : 1e-05
jade : 1e-05
la taguara : 1e-05
wei : 1e-05
place week : 9.999999e-06
place way priced : 9.999999e-06
place welcoming : 9.999999e-06
place welcome : 9.999999e-06
place weird : 9.999999e-06
place weeks : 9.999999e-06
place weekend : 9.999999e-06
place went dinner : 9.999999e-06

SphtKr · 2021-05-26T10:15:58Z

Here's the requested version output, sorry, missed that:

macOS-10.16-x86_64-i386-64bit
Python 3.8.6 (default, Nov 11 2020, 13:20:43) 
[Clang 12.0.0 (clang-1200.0.32.21)]
NumPy 1.19.4
SciPy 1.6.3
gensim 4.0.1
FAST_VERSION 0

piskvorky added the bug Issue described a bug label Mar 15, 2019

piskvorky added need info Not enough information for reproduce an issue, need more info from author impact MEDIUM Big annoyance for affected users labels Oct 8, 2019

piskvorky added the reach LOW Affects only niche use-case users label Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero probabilities in LDA model #2418

Zero probabilities in LDA model #2418

piskvorky commented Mar 15, 2019 •

edited

Loading

enys commented Mar 15, 2019 •

edited

Loading

enys commented Mar 15, 2019 •

edited

Loading

horpto commented Mar 15, 2019

enys commented Mar 15, 2019

horpto commented Mar 16, 2019

enys commented Mar 19, 2019

horpto commented Jun 9, 2019

piskvorky commented Oct 8, 2019

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020 •

edited

Loading

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020

SphtKr commented May 26, 2021

SphtKr commented May 26, 2021

Zero probabilities in LDA model #2418

Zero probabilities in LDA model #2418

Comments

piskvorky commented Mar 15, 2019 • edited Loading

Problem description

Steps/code/corpus to reproduce

Versions

enys commented Mar 15, 2019 • edited Loading

enys commented Mar 15, 2019 • edited Loading

horpto commented Mar 15, 2019

enys commented Mar 15, 2019

horpto commented Mar 16, 2019

enys commented Mar 19, 2019

horpto commented Jun 9, 2019

piskvorky commented Oct 8, 2019

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020 • edited Loading

davidalbertonogueira commented Aug 4, 2020

piskvorky commented Aug 4, 2020

SphtKr commented May 26, 2021

SphtKr commented May 26, 2021

piskvorky commented Mar 15, 2019 •

edited

Loading

enys commented Mar 15, 2019 •

edited

Loading

enys commented Mar 15, 2019 •

edited

Loading

piskvorky commented Aug 4, 2020 •

edited

Loading