Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check what's the reason to use double-precision in topic models #1576

Open
menshikh-iv opened this issue Sep 8, 2017 · 6 comments
Open

Check what's the reason to use double-precision in topic models #1576

menshikh-iv opened this issue Sep 8, 2017 · 6 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest performance Issue related to performance (in HW meaning)

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 8, 2017

Our TMs return vectors with double-precision float64, it looks like very suspicious, because float32 is enough for all. Need to check, what's a reason of this behavior and what's a concrete method.

The first step - look at this line in the test, after it - collect all TMs, that depends on this tests and check, where and why float64 happened.

Result - detailed description (where and why), and fixing this behavior after discussion (if needed)

@menshikh-iv menshikh-iv added the bug Issue described a bug label Sep 8, 2017
@gojomo
Copy link
Collaborator

gojomo commented Sep 8, 2017

FWIW, the Word2Vec/Doc2Vec classes use float32 in vectors/calculations. (I even did experiments converting models to float16 as an optimization; unfortunately it seems bulk array calculations still coerce back to/from float32, meaning no time savings and space savings only at rest.)

@piskvorky piskvorky changed the title Check what's a reason to use double-precision in TMs Check what's the reason to use double-precision in TMs Sep 8, 2017
@menshikh-iv menshikh-iv added difficulty easy Easy issue: required small fix test before incubator performance Issue related to performance (in HW meaning) labels Oct 2, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017
@xelez
Copy link
Contributor

xelez commented Oct 21, 2017

Started investigation.

Test classes that use TestBaseTopicModel

  • TestAuthorTopicModel
  • TestHdpModel
  • TestLdaMallet
  • TestLdaModel
  • TestLsiModel

Went on to LdaModel.

Basically, numpy uses float64 by default, and in LdaModel is never specified explicitly.

  • LdaState.__init__() uses np.zeros() for sstats (which then used for computing topics)
  • LdaModel.__init__() does self.state.sstats = self.random_state.gamma(100., 1. / 100., (self.num_topics, self.num_terms))
  • nd.asarray is used in many places that deduces type from argument by default
  • ndarrays can be passed in __init__(), and they will be used as-is (float64, float32 or whatever)

So, as a result, everything just defaults to float64 so float64 is returned in get_topics().

@menshikh-iv
Copy link
Contributor Author

Good investigation, thanks @xelez.
I think the best solution is to add dtype argument in the constructor and explicitly set dtype everywhere when needed (instead of default float64). Potentially, we'll have same problems for another method (including for training). If this is true - we can easily reduce memory twice & speedup model 🔥

@xelez
Copy link
Contributor

xelez commented Oct 24, 2017

@menshikh-iv Are there any plans on refactoring? If yes, I think this should be done with refactoring.
If no, well.. The only way I can see is to examine every line of code identifying places to apply dtype.

Well, I think I can do it for LdaModel, and maybe other models later. But if anyone will be working on it too, it could easily become a nightmare to merge.

@menshikh-iv
Copy link
Contributor Author

@xelez
Refactoring - yes, but not soon.
We discussed this problem and decided that adding dtype argument to each model in __init__ will be good idea (new field for experiments with comparison models with hi/lo precisions)

It will be very nice if you make PR and start work with it:+1:

@piskvorky
Copy link
Owner

piskvorky commented Oct 24, 2017

I believe the default should be float32 everywhere. This makes no difference to word2vec/doc2vec (which are float32 already), but potentially changing LSI/LDA etc.

@piskvorky piskvorky changed the title Check what's the reason to use double-precision in TMs Check what's the reason to use double-precision in topic models Nov 13, 2017
@mpenkov mpenkov added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

5 participants