Check what's the reason to use double-precision in topic models #1576

menshikh-iv · 2017-09-08T13:33:00Z

Our TMs return vectors with double-precision float64, it looks like very suspicious, because float32 is enough for all. Need to check, what's a reason of this behavior and what's a concrete method.

The first step - look at this line in the test, after it - collect all TMs, that depends on this tests and check, where and why float64 happened.

Result - detailed description (where and why), and fixing this behavior after discussion (if needed)

The text was updated successfully, but these errors were encountered:

gojomo · 2017-09-08T17:24:01Z

FWIW, the Word2Vec/Doc2Vec classes use float32 in vectors/calculations. (I even did experiments converting models to float16 as an optimization; unfortunately it seems bulk array calculations still coerce back to/from float32, meaning no time savings and space savings only at rest.)

xelez · 2017-10-21T11:16:43Z

Started investigation.

Test classes that use TestBaseTopicModel

TestAuthorTopicModel
TestHdpModel
TestLdaMallet
TestLdaModel
TestLsiModel

Went on to LdaModel.

Basically, numpy uses float64 by default, and in LdaModel is never specified explicitly.

LdaState.__init__() uses np.zeros() for sstats (which then used for computing topics)
LdaModel.__init__() does self.state.sstats = self.random_state.gamma(100., 1. / 100., (self.num_topics, self.num_terms))
nd.asarray is used in many places that deduces type from argument by default
ndarrays can be passed in __init__(), and they will be used as-is (float64, float32 or whatever)

So, as a result, everything just defaults to float64 so float64 is returned in get_topics().

menshikh-iv · 2017-10-21T11:43:08Z

Good investigation, thanks @xelez.
I think the best solution is to add dtype argument in the constructor and explicitly set dtype everywhere when needed (instead of default float64). Potentially, we'll have same problems for another method (including for training). If this is true - we can easily reduce memory twice & speedup model 🔥

xelez · 2017-10-24T16:08:34Z

@menshikh-iv Are there any plans on refactoring? If yes, I think this should be done with refactoring.
If no, well.. The only way I can see is to examine every line of code identifying places to apply dtype.

Well, I think I can do it for LdaModel, and maybe other models later. But if anyone will be working on it too, it could easily become a nightmare to merge.

menshikh-iv · 2017-10-24T16:34:39Z

@xelez
Refactoring - yes, but not soon.
We discussed this problem and decided that adding dtype argument to each model in __init__ will be good idea (new field for experiments with comparison models with hi/lo precisions)

It will be very nice if you make PR and start work with it:+1:

piskvorky · 2017-10-24T19:47:35Z

I believe the default should be float32 everywhere. This makes no difference to word2vec/doc2vec (which are float32 already), but potentially changing LSI/LDA etc.

menshikh-iv added the bug Issue described a bug label Sep 8, 2017

piskvorky changed the title ~~Check what's a reason to use double-precision in TMs~~ Check what's the reason to use double-precision in TMs Sep 8, 2017

menshikh-iv added difficulty easy Easy issue: required small fix test before incubator performance Issue related to performance (in HW meaning) labels Oct 2, 2017

menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017

xelez mentioned this issue Oct 26, 2017

Adding dtype to LDAModel to speed it up #1656

Merged

piskvorky changed the title ~~Check what's the reason to use double-precision in TMs~~ Check what's the reason to use double-precision in topic models Nov 13, 2017

mpenkov added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check what's the reason to use double-precision in topic models #1576

Check what's the reason to use double-precision in topic models #1576

menshikh-iv commented Sep 8, 2017 •

edited

Loading

gojomo commented Sep 8, 2017

xelez commented Oct 21, 2017

menshikh-iv commented Oct 21, 2017

xelez commented Oct 24, 2017

menshikh-iv commented Oct 24, 2017

piskvorky commented Oct 24, 2017 •

edited

Loading

Check what's the reason to use double-precision in topic models #1576

Check what's the reason to use double-precision in topic models #1576

Comments

menshikh-iv commented Sep 8, 2017 • edited Loading

gojomo commented Sep 8, 2017

xelez commented Oct 21, 2017

menshikh-iv commented Oct 21, 2017

xelez commented Oct 24, 2017

menshikh-iv commented Oct 24, 2017

piskvorky commented Oct 24, 2017 • edited Loading

menshikh-iv commented Sep 8, 2017 •

edited

Loading

piskvorky commented Oct 24, 2017 •

edited

Loading