Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multicore LDA #232

Merged
merged 31 commits into from
Sep 16, 2014
Merged

multicore LDA #232

merged 31 commits into from
Sep 16, 2014

Conversation

piskvorky
Copy link
Owner

This PR parallelizes LDA training, using multiprocessing. By default it will use all existing cores, to train the LDA model faster.

This functionality is implemented as a new class gensim.models.ldamodel.LdaModelMulticore, which inherits from the existing gensim.models.ldamodel.LdaModel. The original class is not affected.

LdaModelMulticore supports batch training, online training and most other parameters the old implementation did. It doesn't support distributed computing and it doesn't support hyperparameter auto-optimization with alpha='auto'.

ziky90 and others added 30 commits September 10, 2014 14:15
fix bugs in state reset and state init
…og to see when is performed batch version queue merging. This version was tested both in terms of quality and time performance.
py3k compatibility fix in LdaMulticore
@ziky90
Copy link
Contributor

ziky90 commented Sep 16, 2014

Results of time performance experiments on the English Wikipedia, 3.5m documents, 100k vocabulary. Using http://www.hetzner.de/en/hosting/produkte_rootserver/ex40ssd (i7 with 4 real cores, 8 "fake" hyperthread cores).

just iterating over input data, no LDA training
real 20m21.720s
user 20m17.126s
sys 0m1.515s

1 worker
real 150m5.235s
user 267m30.608s
sys 33m56.005s

2 workers
real 84m35.688s
user 224m1.428s
sys 25m29.380s

3 workers
real 66m8.102s
user 220m4.559s
sys 22m53.731s

4 workers
real 63m42.413s
user 231m39.043s
sys 22m30.636s

5 workers
real 62m21.117s
user 247m50.718s
sys 22m16.507s

old LdaModel, for comparison
real 222m52.331s
user 205m22.386s
sys 16m54.866s

piskvorky added a commit that referenced this pull request Sep 16, 2014
@piskvorky piskvorky merged commit 0c2535d into piskvorky:develop Sep 16, 2014
@lerela
Copy link

lerela commented Oct 8, 2014

Getting the following exception with LdaMulticore:

2014-10-08 21:04:24,903 : INFO : accepted corpus with 682440 documents, 200000 features, 77570757 non-zero entries
2014-10-08 21:04:24,958 : INFO : using symmetric alpha at 0.00125
2014-10-08 21:04:24,958 : INFO : using serial LDA version on this node
2014-10-08 21:04:49,624 : INFO : running online LDA training, 800 topics, 20 passes over the supplied corpus of 682440 documents, updating every 150000 documents, evaluating every ~450000 documents, iterating 100x with a convergence threshold of 0.001000
2014-10-08 21:04:49,634 : INFO : training LDA model using 6 processes
2014-10-08 21:05:05,330 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #25000/682440, outstanding queue size 1
Traceback (most recent call last):
  File "/usr/lib/python3.3/multiprocessing/queues.py", line 249, in _feed
    send(obj)
  File "/usr/lib/python3.3/multiprocessing/connection.py", line 207, in send
    self._send_bytes(buf.getbuffer())
  File "/usr/lib/python3.3/multiprocessing/connection.py", line 400, in _send_bytes
    self._send(struct.pack("!i", n))
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Yet the processing goes on. Not sure if the results are gonna be okay, it's still running as you can imagine. But any exception is a problem, right? :)

@lerela
Copy link

lerela commented Oct 8, 2014

50 minutes that the main process is the only one to work (the children use 0% CPU), stuck here: 2014-10-08 21:06:42,249 : INFO : PROGRESS: pass 0, dispatched chunk #11 = documents up to #300000/682440, outstanding queue size 12

@piskvorky
Copy link
Owner Author

Seems like a limitation of Python's multiprocessing library, which cannot send large objects: http://stackoverflow.com/questions/16576386/byte-limit-when-transferring-python-objects-between-processes-using-a-pipe

What chunksize are you using? Try lowering that, to lower the memory footprint.

Failing that, you'll probably have to use either smaller dictionary, or fewer topics (or both)... or monkey around patching multiprocessing manually.

I know that's unfortunate, and it's a silly limitation, but not much I can help with :(

Thanks for reporting though, I'll give it more thought, maybe there's some way.

@lerela
Copy link

lerela commented Oct 11, 2014

Sorry for the late response. Indeed, this exception vanishes with smaller parameters (smaller dic, smaller chunksize). The bottleneck is the memory (even 4 workers is too much for my setup, I have 16GB).
But with 3 workers, no exception and enough RAM, yet the computation has been stuck for 6 hours until I decided to stop it (the 3 threads + the main thread were each using 100% of the cpu, but no output for 6 hours):

2014-10-11 17:02:45,551 : INFO : accepted corpus with 682440 documents, 100000 features, 76197320 non-zero entries
2014-10-11 17:02:45,576 : INFO : using symmetric alpha at 0.00125
2014-10-11 17:02:45,576 : INFO : using serial LDA version on this node
2014-10-11 17:02:58,302 : INFO : running online LDA training, 800 topics, 20 passes over the supplied corpus of 682440 documents, updating every 4000 documents, evaluating every ~12000 documents, iterating 100x with a convergence threshold of 0.001000
2014-10-11 17:02:58,310 : INFO : training LDA model using 2 processes
2014-10-11 17:03:00,224 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/682440, outstanding queue size 1
2014-10-11 17:03:03,499 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/682440, outstanding queue size 2
2014-10-11 17:03:06,472 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/682440, outstanding queue size 3
2014-10-11 17:03:09,114 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/682440, outstanding queue size 4
2014-10-11 17:03:10,269 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #10000/682440, outstanding queue size 5
2014-10-11 17:03:11,465 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #12000/682440, outstanding queue size 6
^CTraceback (most recent call last):
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 243, in update
    job_queue.put((chunk_no, chunk, self), block=False, timeout=0.1)
  File "/usr/lib/python3.3/multiprocessing/queues.py", line 79, in put
    raise Full
queue.Full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 54, in <module>
    prepareGensimLda(args.corpus, args.ntopic, args.l)
  File "model.py", line 24, in prepareGensimLda
    lda = gensim.models.ldamulticore.LdaMulticore(corpus=tfidf_corpus, id2word=id2word, num_topics=ntopic, chunksize=2000, passes=20, workers=2, iterations=100, eval_every=3)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 136, in __init__
    gamma_threshold=gamma_threshold)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamodel.py", line 313, in __init__
    self.update(corpus)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 243, in update
    job_queue.put((chunk_no, chunk, self), block=False, timeout=0.1)
KeyboardInterrupt
^C

I guess that must come from this specific dataset. I had trained the regular LDA model on it 3 months ago and it worked fine though (even if it was slow, of course)... I'll try to run it again to make sure the issue does not come from the multicore implementation. Thank you Radim for your answer.

@lerela
Copy link

lerela commented Oct 16, 2014

Well, I do think there is a problem here. I've launched the multicore lda on a much much smaller corpus, and it's been stuck for more than 7 hours on the same perplexity estimate than previously (ie. the first one). When I ^C the job, it's again stuck in a queue.Full loop. That doesn't seem right to me.

2014-10-16 01:43:27,845 : INFO : accepted corpus with 35360 documents, 70313 features, 8148306 non-zero entries
2014-10-16 01:43:27,863 : INFO : using symmetric alpha at 0.00125
2014-10-16 01:43:27,863 : INFO : using serial LDA version on this node
2014-10-16 01:43:36,513 : INFO : running online LDA training, 800 topics, 20 passes over the supplied corpus of 35360 documents, updating every 8000 documents, evaluating every ~24000 documents, iterating 100x with a convergence threshold of 0.001000
2014-10-16 01:43:36,516 : INFO : training LDA model using 4 processes
2014-10-16 01:43:38,234 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/35360, outstanding queue size 1
2014-10-16 01:43:41,200 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/35360, outstanding queue size 2
2014-10-16 01:43:44,558 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/35360, outstanding queue size 3
2014-10-16 01:43:48,692 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/35360, outstanding queue size 4
2014-10-16 01:43:52,333 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #10000/35360, outstanding queue size 5
2014-10-16 01:43:55,636 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #12000/35360, outstanding queue size 6
2014-10-16 01:43:57,597 : INFO : PROGRESS: pass 0, dispatched chunk #6 = documents up to #14000/35360, outstanding queue size 7
2014-10-16 01:43:59,589 : INFO : PROGRESS: pass 0, dispatched chunk #7 = documents up to #16000/35360, outstanding queue size 8
2014-10-16 01:44:00,993 : INFO : PROGRESS: pass 0, dispatched chunk #8 = documents up to #18000/35360, outstanding queue size 9
2014-10-16 01:44:02,122 : INFO : PROGRESS: pass 0, dispatched chunk #9 = documents up to #20000/35360, outstanding queue size 10
2014-10-16 01:44:03,280 : INFO : PROGRESS: pass 0, dispatched chunk #10 = documents up to #22000/35360, outstanding queue size 11
2014-10-16 01:44:04,401 : INFO : PROGRESS: pass 0, dispatched chunk #11 = documents up to #24000/35360, outstanding queue size 12

^CTraceback (most recent call last):
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 243, in update
    job_queue.put((chunk_no, chunk, self), block=False, timeout=0.1)
  File "/usr/lib/python3.3/multiprocessing/queues.py", line 79, in put
    raise Full
queue.Full

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "model.py", line 54, in <module>
    prepareGensimLda(args.corpus, args.ntopic, args.l)
  File "model.py", line 24, in prepareGensimLda
    lda = gensim.models.ldamulticore.LdaMulticore(corpus=tfidf_corpus, id2word=id2word, num_topics=ntopic, chunksize=2000, passes=20, workers=4, iterations=100, eval_every=3)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 136, in __init__
    gamma_threshold=gamma_threshold)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamodel.py", line 313, in __init__
    self.update(corpus)
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 252, in update
    process_result_queue()
  File "/usr/local/lib/python3.3/dist-packages/gensim/models/ldamulticore.py", line 225, in process_result_queue
    while not result_queue.empty():
  File "/usr/lib/python3.3/multiprocessing/queues.py", line 123, in empty
    return not self._poll()
  File "/usr/lib/python3.3/multiprocessing/connection.py", line 254, in poll
    def poll(self, timeout=0.0):
KeyboardInterrupt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants