Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LdaMulticore livelock when documents converge? #244

Closed
danwiesenthal opened this issue Oct 13, 2014 · 11 comments
Closed

LdaMulticore livelock when documents converge? #244

danwiesenthal opened this issue Oct 13, 2014 · 11 comments
Assignees

Comments

@danwiesenthal
Copy link

Hi,
I'm seeing unreliable behavior in LdaMulticore when I tweak parameters like the number of iterations or passes. Sometimes the lda run goes fine and all cores seem to be reasonably well utilized; other times, notably when the iterations/passes are higher, it hangs without output for a very long time (2days+ when a usual run takes 1.5hrs), with one core constantly at or near 100%. A trend I've noticed that may be indicative is that the system always gets stuck with a worker waiting for a new job. Another trend is that when the system gets stuck the debug logs usually say something about having converged within X iterations. Is it possible a livelock situation is occurring? See output below, which shows a 'getting stuck' instance including the two 'trends' I mention above.
Cheers,
Dan

call:
lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=reduced_dimensionality, passes=100, batch=True, iterations=100, workers=8)

output:
[20141013-14:33PM] [gensim.models.ldamodel] [INFO] using symmetric alpha at 0.02
[20141013-14:33PM] [gensim.models.ldamodel] [INFO] using serial LDA version on this node
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] running batch LDA training, 50 topics, 100 passes over the supplied corpus of 15234 documents, updating every 15234 documents, evaluating every ~15234 documents, iterating 100x with a convergence threshold of 0.001000
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] training LDA model using 8 processes
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/15234, outstanding queue size 1
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #0 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/15234, outstanding queue size 2
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #1 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/15234, outstanding queue size 3
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #2 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/15234, outstanding queue size 4
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #3 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #4 = documents up to #10000/15234, outstanding queue size 5
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #5 = documents up to #12000/15234, outstanding queue size 6
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #6 = documents up to #14000/15234, outstanding queue size 7
[20141013-14:33PM] [gensim.models.ldamulticore] [INFO] PROGRESS: pass 0, dispatched chunk #7 = documents up to #15234/15234, outstanding queue size 8
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #4 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #5 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #6 of 2000 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 2000 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processing chunk #7 of 1234 documents
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] performing inference on a chunk of 1234 documents
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] worker process entering E-step loop
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job
[20141013-14:33PM] [gensim.models.ldamodel] [DEBUG] 299/1234 documents converged within 100 iterations
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] processed chunk, queuing the result
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] result put
[20141013-14:33PM] [gensim.models.ldamulticore] [DEBUG] getting a new job

@piskvorky
Copy link
Owner

Hmm, I've received a very similar report on the mailing list here.

I think it may have something to do with the fact that workers * chunksize > len(corpus), both here and there, but I haven't had time to investigate yet.

Can you share your corpus + dictionary Dan, for debugging?

And thanks for reporting.
CC @ziky90 .

@danwiesenthal
Copy link
Author

Unfortunately I can't share the corpus/dictionary since it's proprietary data, but are there aggregate statistics that might be helpful? I'd like to help as much as I can, but my hands are a little tied wrt sharing data

@piskvorky
Copy link
Owner

FYI, I'm on it, just busy days :)

@alif
Copy link

alif commented Apr 1, 2015

Hey - @danwiesenthal and I were working with some data which we can share and have run into this issue again. Would it be helpful to get this data to you to work with?

@piskvorky
Copy link
Owner

Sure, thanks!

Let me assign @ziky90 to this, who will assist you.

@ziky90
Copy link
Contributor

ziky90 commented Apr 5, 2015

Hi @alif
Thank you that would be great and probably would help a lot.
Could you please send me the data by email, google docs, or some other your preferred way?

@alif
Copy link

alif commented Apr 6, 2015

Email is probably best. What email address should I send the data to?

@ziky90
Copy link
Contributor

ziky90 commented Apr 6, 2015

You can send the data to my email: ziky90@gmail.com, thanks.

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@ziky90 Is this resolved?

@ziky90
Copy link
Contributor

ziky90 commented Jan 25, 2016

I was not able to replicate the bug, so I guess that we can close this and we'll se if someone else will reopen this?

@tmylk tmylk closed this as completed Jan 25, 2016
@ArchyLau
Copy link

ArchyLau commented Jun 9, 2017

Hi,
I have found the same problem today.
My code:
ldares = gensim.models.ldamulticore.LdaMulticore(corpus=a_corpus, num_topics=10, id2word=a_id2word)

but it takes too long time and can not output anything ..
When I use LdaModel, it works...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants