Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AuthorTopicModel memory issue #1947

Open
menshikh-iv opened this issue Mar 2, 2018 · 1 comment
Open

AuthorTopicModel memory issue #1947

menshikh-iv opened this issue Mar 2, 2018 · 1 comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 2, 2018

Intro

Recently, I often get negative feedback about ATM.
Еhe main reason is memory issues (too much memory consuming), related mailing list threads (latest):

I decided to figure out what was going on.

Investigation

I run ATM based on data provided by the author of https://groups.google.com/forum/#!searchin/gensim/author|sort:date/gensim/gG7aiNI1v-Y/SWPMuP8BAwAJ (so far I can't publish it right now, I'm waiting for permission from its owner).

Basic stats of data:

  • Size of dictionary: 12211
  • Size of author2doc mapping: 106133
  • Size of author2doc mapping: 73248

I run it with a debugger and found that hugest memory-consuming happens here:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/atmodel.py#L680-L684

I stop it when process already consume 8GB of RAM, some useful statistics presented in table

expr value comment
len(author2doc.keys()) 106133
author2doc.keys().index(_) 3649 index of current processed element, i.e. 3649 of 106133 (~3% of the total volume)
len(train_corpus_idx) 1119955735 train_corpus_idx is hugest memory consumer. Here, we essentially load the whole corpus into memory (and this isn't "online" or "batch" processing)

By simple calculations, when the cycle will be done, the process will consume ~232GB of RAM.
This is definitely unacceptable and doesn't allow to use model even for some learning tasks (I'm not even talking about "real" tasks).

@olavurmortensen can you look into this problem, this is supercritical?

Related PR - #893.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills performance Issue related to performance (in HW meaning) difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Mar 2, 2018
@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Mar 5, 2018

Updates:

  1. Unfortunately, Olavur has no time to resolve this issue
  2. More information (about ATM) can be found here: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6971/pdf/imm6971.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

1 participant