Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDAModel depracation warning (issue 494) #513

Merged
merged 6 commits into from
Nov 17, 2015

Conversation

mattilyra
Copy link
Contributor

I decided to not call utils.grouper with as_numpy=False and break the Pyro stuff as I thought we can work around that. So now there's just a check in .inference that makes sure the indices are ints when a numpy array is passed in.

This seemed like a better choice as it doesn't introduce any breaking changes unnecessarily.

@@ -367,7 +367,15 @@ def inference(self, chunk, collect_sstats=False):
# Lee&Seung trick which speeds things up by an order of magnitude, compared
# to Blei's original LDA-C code, cool!).
for d, doc in enumerate(chunk):
ids = [id for id, _ in doc]
if isinstance(doc, numpy.ndarray):
if len(doc) == 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(array) will return the length along the first axis (=number of rows). Is that always the case here, is it what we want?

@piskvorky
Copy link
Owner

Thanks Matti.

We can do these ad-hoc fixes in various places, though I still think it's cleaner to solve this at the place of origin = where the float values are generated (as_numpy=False).

Also, what's the advantage of your if solution vs. a direct ids = [int(id) for id, _ in doc]?

@mattilyra
Copy link
Contributor Author

If the doc is already a numpy array it's faster to cast it to int_ (the default numpy integer, 64bit I guess on most machines) instead of calling int() in the list comprehension - at least on the machine I tried.

I didn't consider the case of 1-d documents, the len(doc) also won't be semantically consistent with them. I'll change the call to utils.grouper in the end that'll end up being much cleaner as you say.

I thought I was being so clever.

@piskvorky
Copy link
Owner

:D

You are clever, just in this case I think simplicity and consistency are more important (it's not the bottleneck).

…for calling .update with chunks_as_numpy=True/False.
@mattilyra
Copy link
Contributor Author

Ok, so I decided to keep the check for the IDs being ints since I assume numpy will complain about it, pyro or no pyro. The default is now for the chunks to be not numpy, and there's an option on .update to make them numpy.

I had an issue with some tests failing because my distro wasn't picking up on the added keyword parameter to .update, but I think that was just a local issue on my machine.

@piskvorky
Copy link
Owner

That looks like a nice solution! Thanks.

If it later turns out this "make sure ids are ints" functionality is more widely used, we could factor it out into a separate function, something like matutils.check_vector().

Travis reports all tests as failing though:

======================================================================
ERROR: testTransform (gensim.test.test_ldamodel.TestLdaMulticore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/piskvorky/gensim/gensim/test/test_ldamodel.py", line 382, in setUp
    self.model = self.class_(corpus, id2word=dictionary, num_topics=2, passes=100)
  File "/home/travis/build/piskvorky/gensim/gensim/models/ldamulticore.py", line 141, in __init__
    gamma_threshold=gamma_threshold)
  File "/home/travis/build/piskvorky/gensim/gensim/models/ldamodel.py", line 331, in __init__
    self.update(corpus, chunks_as_numpy=use_numpy)
TypeError: update() got an unexpected keyword argument 'chunks_as_numpy'

@piskvorky
Copy link
Owner

Ah I see, the fails are coming from LdaMulticore, which inherits from LdaModel class but probably overrides its update(). Can you add the same parameter there as well?

@piskvorky
Copy link
Owner

Thank you very much Matti, ready to merge!

Can you add yourself + a short description of this change to CHANGELOG?

@mattilyra
Copy link
Contributor Author

I always forget the CHANGELOG

piskvorky added a commit that referenced this pull request Nov 17, 2015
LDAModel depracation warning (issue 494)
@piskvorky piskvorky merged commit d25ae02 into piskvorky:develop Nov 17, 2015
@piskvorky
Copy link
Owner

Great, thanks!

@tmylk
Copy link
Contributor

tmylk commented Nov 17, 2015

Unfortunately LDA doesn't converge on Windows.

====================================================================


FAIL: testTransform (gensim.test.test_ldamodel.TestLdaMulticore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_ldamodel.py", line 79, in testTransform
    self.assertTrue(passed)
AssertionError: False is not true
-------------------- >> begin captured logging << --------------------

@piskvorky
Copy link
Owner

That's probably unrelated to this PR -- or do you see a connection?

In general, on these "Windows failures": maybe the RNG works differently on Windows, which causes tests that pass on *NIX to fail on Windows?

@tmylk
Copy link
Contributor

tmylk commented Nov 18, 2015

It was passing on Sunday and it's the only change I can find around it.
Numpy and scipy versions remained the same.
On 18 Nov 2015 05:30, "Radim Řehůřek" notifications@github.com wrote:

That's probably unrelated to this PR -- or do you see a connection?


Reply to this email directly or view it on GitHub
#513 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants