-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Per Word Topic in a Document + Notebook Tutorial #704
Conversation
self.assertTrue(isinstance(t, int)) | ||
|
||
# first word in the doc belongs to topic 2 | ||
self.assertEqual(word_topics[0], (0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a tuple (0,1)
? Expected 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple (0,1)
basically means 0th word corresponds to 1st topic.
Should I change it to word_topics[0][1]
equals to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk changed it. Does it make more sense?
@tmylk , @piskvorky , could you please review? |
return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist) | ||
if topicvalue >= minimum_probability] | ||
|
||
if per_word_topics is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not per_word_topics
more Pythonic.
@piskvorky , taken into account all your suggestions and implemented it. The current return format is |
@tmylk , I think this is ready for a final review and merge. Could you have a look? |
Looks good. Just a changelog and an ipython notebook with colored words would be good. An example of static word assignment would help too. |
Oh yes, good idea. A notebook to explain and demo this new functionality (coloured document words; topics for static terms) would be great. Did you review the phi logic with @tmylk ? |
Yup, I have a nice idea for a notebook to accompany this. :) |
@tmylk , @piskvorky , updated the notebook to give a little more illustrations of Document Word coloring. I briefly explain the use case and new method functionalities before jumping into it. Made changes to changelog as well, so if the logic and notebook are fine, it's ready to merge from my end. |
Awesome, thanks! This is a much awaited functionality :) @tmylk how do we get all these nice little notebooks to users, how do they find them? |
@piskvorky , could you have a brief look at the conversation on #683 towards the end and let me know if the idea of adding a The more popular one which will be used for practical purposes is the list of most probable |
I see. It makes sense to me -- since we're implementing this, we might as well offer the "full power" to users. Not sure we need an extra parameter though. Can't we always return the same thing (both mappings), for consistency? |
for word_type, weight in bow: | ||
phi_values = [] # contains phi values for each topic | ||
for topic_id in range(0, self.num_topics): | ||
if phis[topic_id][word_type] >= minimum_probability: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be a different parameter minimium_phi_probability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Wouldn't hurt to add it I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: if the user does not enter a minimum_phi_probability, it ends up taking self.minimum_probability
's value.
@tmylk , @piskvorky I have also added a small portion in the notebook to show phi_values and it's scaling - this should make it clear for people if they go through the notebook, they don't have to hunt for the scaling in the doc strings. Just to clarify, right now if |
@bhargavvader , thanks a lot for making this revision for uses like me and answering my questions raised in the comment to issue #683. I have a minor comment and hope that this is not a nuisance. I understand that you cannot really scale phi_values back using |
@bhargavvader Did you test it on your Mac? The test fails in all pythons in our OS X environment with xcod7.0 and xcode 7.3 too. It also fails in windows build (appveyor) with the same error. https://travis-ci.org/MacPython/gensim-wheels/jobs/136597581 Can the test be made to be less sensitive to multihreading?
|
With reference to #683 .
@piskvorky , @tmylk , as you suggested, if
per_word_topics=True
, it returns a 2-tuple with(per-topic-probability-for-this-document, per-wordtype-best-topic-for-this-document)
.It runs fine for the test cases on my local machine, and if the logic and approach looks fine I can write tests for it.