Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero probabilities in LDA model #2418

Open
piskvorky opened this issue Mar 15, 2019 · 16 comments
Open

Zero probabilities in LDA model #2418

piskvorky opened this issue Mar 15, 2019 · 16 comments
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users need info Not enough information for reproduce an issue, need more info from author reach LOW Affects only niche use-case users

Comments

@piskvorky
Copy link
Owner

piskvorky commented Mar 15, 2019

Problem description

A user reported "empty" topics (all probabilities zero), during LdaModel training:
https://groups.google.com/forum/#!topic/gensim/LuPD2VSouSQ

Apparently some of the recent optimizations in #1656 (and maybe elsewhere?) introduced numeric instabilities.

Steps/code/corpus to reproduce

Unknown. Probably related to large data size: large vocabulary in combination with large number of topics, leading to float32 under/overflows.

User reported that changing the dtype back to float64 helped and the "empty topics" problem went away.

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
@piskvorky piskvorky added the bug Issue described a bug label Mar 15, 2019
@enys
Copy link

enys commented Mar 15, 2019

Hi @piskvorky,

Apparently it was one of my team members.
Please find attached the output:

Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-4.4.0-130-generic-x86_64-with-debian-stretch-sid
>>> import sys; print("Python", sys.version)
Python 3.6.8 (default, Mar 15 2019, 14:14:12)
[GCC 5.4.0 20160609]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.16.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.7.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.2.1

@enys
Copy link

enys commented Mar 15, 2019

It is indeed with a combination of large vocabulary + topics. Topics 500 and 1000 suffer the problem. Our dict size is > 300K. We also use online updates chunks of 100K documents, with a target total corpus size of 50M

@horpto
Copy link
Contributor

horpto commented Mar 15, 2019

Hi, @enys
Can you share a minimal dataset that reproduce the problem?

@enys
Copy link

enys commented Mar 15, 2019

Quick answer, no.
Dictionary contains 519K corpus is built from precalculated bagwords. I will paste my parameters.
I could try ton build a random corpus/dict if there is a high probability that it is due to cardinality.
I have a run computing over the weekend.

@horpto
Copy link
Contributor

horpto commented Mar 16, 2019

Quite stupid question: how can a topic probabilities be a fully zeros if show_topics normalizes topic row ? https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L1163

@enys
Copy link

enys commented Mar 19, 2019

Hi @horpto, Sorry for the late reply.
Fully might be a slight over statement, however it renders :

[(178, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(299, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(281, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(208, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(485, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(72, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(65, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(332, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(267, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

(75, '0.000*"giitane" + 0.000*"giljan" + 0.000*"gianyar" + 0.000*"gibiloba" + 0.000*"giesteira" + 0.000*"giitana" + 0.000*"ghuysen" + 0.000*"gileras" + 0.000*"giocarsela" + 0.000*"giltmagazine"'),

@horpto
Copy link
Contributor

horpto commented Jun 9, 2019

@enys, sorry for late response, Lda.show_topics shows only the first 10 words from the topic by default and moreover, it rounds topic probabilities to 3 digits after the point. Can you show topics with formatted=False and quite large num_words parameter?

@piskvorky piskvorky added need info Not enough information for reproduce an issue, need more info from author impact MEDIUM Big annoyance for affected users labels Oct 8, 2019
@piskvorky
Copy link
Owner Author

Ping @enys are you able to share a reproducible example? We'd like to get to the bottom of this.

@piskvorky piskvorky added the reach LOW Affects only niche use-case users label Oct 8, 2019
@davidalbertonogueira
Copy link

I have this issue (zero probabilities for words in show_topics) only when using gensim.models.LdaMulticore. Output of gensim.models.ldamodel.LdaModel is as expected.

@piskvorky
Copy link
Owner Author

@davidalbertonogueira Same comments as above apply.

@davidalbertonogueira
Copy link

I'm sorry but my current dataset is proprietary. I reckon that I could try to create a small example that generates the same error, but I would have to do it with online publically available data, and therefore, there's no point in doing that myself.

I share the dimensions in case it helps someone trying to replicate the error:
len(gensim_corpus) = 109'000
len(gensim_dictionary) = 13989
n_topics = 10

@piskvorky
Copy link
Owner Author

piskvorky commented Aug 4, 2020

@davidalbertonogueira that seems different from the issue reported here, which had a huge (500k) vocabulary and lots of topics (1000). In your case, you have only 14k vocab + 10 topics. Likely unrelated, a separate issue.

@davidalbertonogueira
Copy link

Should I open a new issue then? @piskvorky

@piskvorky
Copy link
Owner Author

Only if you're able to include the reproducing example :) Otherwise there isn't much we'll be able to do anyway. Thanks.

@SphtKr
Copy link

SphtKr commented May 26, 2021

I have far less experience than the other reporters (i.e. it could be something I'm doing wrong), but I'm seeing the same thing--one or more topics with near-zero probabilities, and the terms are usually alphabetically contiguous. My corpus is derived from the Yelp Dataset Challenge licensed for academic use...I may be able to share the contents but unsure, I'll have to read closely. However, it's also very small and I'm doing a small number of topics (10-100)...again, could be something naive I'm doing.

My code looks like this... the very low max_df is on purpose as I was trying to get rid of lots of irrelevant features cheaply... if nothing else looks stupid, I can try to contribute a reproduction.

    vectorizer = TfidfVectorizer(max_df=0.1, max_features=numfeatures,
                                     min_df=2, stop_words='english',
                                     use_idf=True, ngram_range=(1,3))

    X = vectorizer.fit_transform(text)

    id2words ={}
    for i,word in enumerate(vectorizer.get_feature_names()):
        id2words[i] = word

    corpus = matutils.Sparse2Corpus(X,  documents_columns=False)

    lda = models.ldamodel.LdaModel(corpus=corpus,
                                   id2word=id2words,
                                   num_topics=10, 
                                   update_every=1,
                                   chunksize=100,
                                   passes=5,
                                   alpha='auto',
                                   per_word_topics=True)

In the latest run that runs over 74310 documents with 100000 features.

Then I dump the topics to a text file (among other things) and my "empty" topic looks like this:

Topic: 4
chino : 1e-05
pei wei : 1e-05
taguara : 1e-05
gelato spot : 1e-05
jade : 1e-05
la taguara : 1e-05
wei : 1e-05
place week : 9.999999e-06
place way priced : 9.999999e-06
place welcoming : 9.999999e-06
place welcome : 9.999999e-06
place weird : 9.999999e-06
place weeks : 9.999999e-06
place weekend : 9.999999e-06
place went dinner : 9.999999e-06

@SphtKr
Copy link

SphtKr commented May 26, 2021

Here's the requested version output, sorry, missed that:

macOS-10.16-x86_64-i386-64bit
Python 3.8.6 (default, Nov 11 2020, 13:20:43) 
[Clang 12.0.0 (clang-1200.0.32.21)]
NumPy 1.19.4
SciPy 1.6.3
gensim 4.0.1
FAST_VERSION 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users need info Not enough information for reproduce an issue, need more info from author reach LOW Affects only niche use-case users
Projects
None yet
Development

No branches or pull requests

5 participants