Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

Merged
merged 15 commits into from
Jun 9, 2016

Conversation

bhargavvader
Copy link
Contributor

With reference to #683 .
@piskvorky , @tmylk , as you suggested, if per_word_topics=True, it returns a 2-tuple with (per-topic-probability-for-this-document, per-wordtype-best-topic-for-this-document).

It runs fine for the test cases on my local machine, and if the logic and approach looks fine I can write tests for it.

self.assertTrue(isinstance(t, int))

# first word in the doc belongs to topic 2
self.assertEqual(word_topics[0], (0, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a tuple (0,1)? Expected 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple (0,1) basically means 0th word corresponds to 1st topic.
Should I change it to word_topics[0][1] equals to 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk changed it. Does it make more sense?

@bhargavvader bhargavvader changed the title [WIP] Per Word Topic in a Document. [MRG] Per Word Topic in a Document. May 26, 2016
@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , could you please review?

return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)
if topicvalue >= minimum_probability]

if per_word_topics is False:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not per_word_topics more Pythonic.

@bhargavvader
Copy link
Contributor Author

@piskvorky , taken into account all your suggestions and implemented it. The current return format is ({word_id => [topic_id_most_probable, topic_id_second_most_probable, ...]), which seems like a fair middle-ground.
Am still unsure of whether the phis are directly comparable, like you said...
@tmylk , could you clear that up and whether the new output format is ok?

@bhargavvader
Copy link
Contributor Author

@tmylk , I think this is ready for a final review and merge. Could you have a look?

@tmylk
Copy link
Contributor

tmylk commented Jun 1, 2016

Looks good. Just a changelog and an ipython notebook with colored words would be good. An example of static word assignment would help too.

@piskvorky
Copy link
Owner

piskvorky commented Jun 1, 2016

Oh yes, good idea. A notebook to explain and demo this new functionality (coloured document words; topics for static terms) would be great.

Did you review the phi logic with @tmylk ?

@bhargavvader
Copy link
Contributor Author

Yup, I have a nice idea for a notebook to accompany this. :)
@piskvorky , I've spoken to @tmylk about it and it seems fine - since the comparisons are between topics and not words, the feature length doesn't change the most probable topic(s). I've also tried it out on a few corpus/docs and it gives consistent results.

@bhargavvader bhargavvader changed the title [MRG] Per Word Topic in a Document. [MRG] Per Word Topic in a Document + Notebook Tutorial Jun 2, 2016
@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , updated the notebook to give a little more illustrations of Document Word coloring. I briefly explain the use case and new method functionalities before jumping into it.

Made changes to changelog as well, so if the logic and notebook are fine, it's ready to merge from my end.

@piskvorky
Copy link
Owner

Awesome, thanks! This is a much awaited functionality :)

@tmylk how do we get all these nice little notebooks to users, how do they find them?

@bhargavvader
Copy link
Contributor Author

@piskvorky , could you have a brief look at the conversation on #683 towards the end and let me know if the idea of adding a per_word_phi_topics option to the user is viable?
So, just to confirm, we'll have get_document_topics(per_word_topics=False, per_word_phi_values=False) as the API, where if both are true then we'll return both a word_id => [most_probable_topic, ...], and also a word_id => [(topic_0, phi_value),....(topic_n, phi_value)]. If only one of them is true, we obviously return only that one.

The more popular one which will be used for practical purposes is the list of most probable topic_ids (using which I have demonstrated document word coloring), but if the user really wishes to have a look at the phi_values, I think that option should be there.

@piskvorky
Copy link
Owner

I see. It makes sense to me -- since we're implementing this, we might as well offer the "full power" to users.

Not sure we need an extra parameter though. Can't we always return the same thing (both mappings), for consistency?

for word_type, weight in bow:
phi_values = [] # contains phi values for each topic
for topic_id in range(0, self.num_topics):
if phis[topic_id][word_type] >= minimum_probability:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be a different parameter minimium_phi_probability ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Wouldn't hurt to add it I guess.

Copy link
Contributor Author

@bhargavvader bhargavvader Jun 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: if the user does not enter a minimum_phi_probability, it ends up taking self.minimum_probability's value.

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky
Just an update: I've made changes in the documentation to reflect the phi_values being scaled, and the returning of both sorted topics and the phi_vales.
I've made changes to the tests to reflect this as well.

I have also added a small portion in the notebook to show phi_values and it's scaling - this should make it clear for people if they go through the notebook, they don't have to hunt for the scaling in the doc strings.

Just to clarify, right now if per_word_topics is true, it returns both sorted list of likely word-topics and the phi_values.

@bhargavvader bhargavvader reopened this Jun 9, 2016
@tmylk tmylk merged commit 5e0e830 into piskvorky:develop Jun 9, 2016
@graychan
Copy link

graychan commented Jun 9, 2016

@bhargavvader , thanks a lot for making this revision for uses like me and answering my questions raised in the comment to issue #683.

I have a minor comment and hope that this is not a nuisance. I understand that you cannot really scale phi_values back using cts if you have multiple documents in bow. However, in the code, I wonder whether minimum_phi_probability should be changed to minimum_phi_value since the phi_values that are compared with minimum_phi_probability are not probabilities.

@tmylk
Copy link
Contributor

tmylk commented Jun 10, 2016

@bhargavvader Did you test it on your Mac?

The test fails in all pythons in our OS X environment with xcod7.0 and xcode 7.3 too. It also fails in windows build (appveyor) with the same error.

https://travis-ci.org/MacPython/gensim-wheels/jobs/136597581
https://ci.appveyor.com/project/piskvorky/gensim-4kei4/build/job/5suhvxcbp9v0h1km

Can the test be made to be less sensitive to multihreading?

FAIL: testGetDocumentTopics (gensim.test.test_ldamodel.TestLdaMulticore)

Traceback (most recent call last):
File "/Users/travis/build/MacPython/gensim-wheels/venv/lib/python3.3/site-packages/gensim/test/test_ldamodel.py", line 287, in testGetDocumentTopics
self.assertEqual(word_topics[0][1], expected_topiclist)
nose.proxy.AssertionError: Lists differ: [0, 1] != [1, 0]
First differing element 0:
0
1

  • [0, 1]
  • [1, 0]

@bhargavvader
Copy link
Contributor Author

@tmylk , that's very strange, it passes fine on my machine. And it should ideally pass on a multithreaded version as well. At any rate I'll make changes and open a PR.

@graychan I'm opening a PR to address the test problems, I'll change probabilities to values while I'm at it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants