Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make show_topics more consistent across models #448

Merged
merged 5 commits into from
Oct 5, 2015
Merged

Make show_topics more consistent across models #448

merged 5 commits into from
Oct 5, 2015

Conversation

cscorley
Copy link
Contributor

This PR contains changes to address issues #354 and #389. Note that this change breaks compatibility for anyone that has come to rely on the format for a particular model.

Additional changes include splitting up the increasingly growing test_models.py file into individual files :)

@cscorley
Copy link
Contributor Author

FYI, I think the latest numpy will make these tests failing (at least on my local Python 3 environment). Seems that, at least for LdaModel, that the index of numpy arrays will be numpy.int64 and no longer int. There's also some deprecation warnings being thrown from numpy related to how LdaModel indexes into arrays.

@piskvorky
Copy link
Owner

Great, thanks! Can you also add a short summary to README, explaining whom this affects and how?

Re. failing tests: what's the problem with int64?

Re. indexing: I know this one. It's due to this line: that as_numpy=True option forces utils.grouper to return documents as floats:

doc = [(1, 0.5), (2, 0.3)]
print numpy.asarray(doc)

array([[ 1. ,  0.5],
       [ 2. ,  0.3]])

(the feature indexes become floats).

This was an optimization for sending data chunks via Pyro in distributed mode, so that they serialize to smaller chunks + 6x speed up in transfer. I think it's no longer important => let's as_numpy=False.

@cscorley
Copy link
Contributor Author

w.r.t., int64: It should be handled now in the test explicitly, but numpy was returning indexes as int in Python 2 while numpy.int64 in Python 3. Very strange.

@piskvorky
Copy link
Owner

@cscorley does this need anything else, or are we ready to merge? (after resolving the conflict)

@piskvorky
Copy link
Owner

@cscorley we'll make a new release again this week. Any chance we could get this in?

@cscorley
Copy link
Contributor Author

Sorry @piskvorky, only really have the chance to work on weekends :) Luckily, the merge conflict was only in CHANGELOG.txt, so this should be ready to go.

@piskvorky
Copy link
Owner

Perfect, thanks.

@tmylk please review & merge. This is a major, breaking change, so we have to be very explicit about this change in the release notes / announcement!

Probably warrants a major version bump too.

tmylk added a commit that referenced this pull request Oct 5, 2015
Make show_topics more consistent across models
@tmylk tmylk merged commit 28dfe05 into develop Oct 5, 2015
@tmylk tmylk deleted the pr_354_389 branch October 5, 2015 19:37
@piskvorky
Copy link
Owner

@cscorley I see as_numpy is still True. Is there a reason we don't want the indices to be integer? Or how did you handle this (maybe I misunderstood your comment above).

Possibly related to #494 .

@cscorley
Copy link
Contributor Author

@piskvorky more or less, I didn't change it here because I don't know how that would affect the Pyro stuff, so I left it as is for this PR. No idea how to make a test for that, not familiar with Pyro at all.

@piskvorky
Copy link
Owner

Alright! Let's change it in a separate PR. It will affect Pyro performance (=will need more memory in the distributed computing), but that use case is so marginal I can't say I care much. Let's just mention it visibly in the CHANGELOG, for people who do use the distributed functionality.

Maybe we can even add this is an LDA parameter (default as_numpy=False), rather than hardwiring True/False there. That way, users can choose the value themselves.

@mattilyra are you up for this PR?

@mattilyra
Copy link
Contributor

@piskvorky Yes I can take it. Having a parameter as_numpy=False/True for LDA isn't particularly clear though and needs a lot of explaining. Besides wouldn't as_numpy=True break (or generate a lot of warnings) for Pyro as well under Py3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants