[MRG] Per Word Topic in a Document + Notebook Tutorial #704

bhargavvader · 2016-05-22T08:15:25Z

With reference to #683 .
@piskvorky , @tmylk , as you suggested, if per_word_topics=True, it returns a 2-tuple with (per-topic-probability-for-this-document, per-wordtype-best-topic-for-this-document).

It runs fine for the test cases on my local machine, and if the logic and approach looks fine I can write tests for it.

tmylk · 2016-05-25T18:59:14Z

gensim/test/test_ldamodel.py

+            self.assertTrue(isinstance(t, int))
+
+        # first word in the doc belongs to topic 2
+        self.assertEqual(word_topics[0], (0, 1))


Why is it a tuple (0,1)? Expected 1

Tuple (0,1) basically means 0th word corresponds to 1st topic.
Should I change it to word_topics[0][1] equals to 1?

@tmylk changed it. Does it make more sense?

bhargavvader · 2016-05-28T17:36:15Z

@tmylk , @piskvorky , could you please review?

piskvorky · 2016-05-30T00:40:46Z

gensim/models/ldamodel.py

-        return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)
-                if topicvalue >= minimum_probability]
+
+        if per_word_topics is False:


if not per_word_topics more Pythonic.

bhargavvader · 2016-05-30T09:53:39Z

@piskvorky , taken into account all your suggestions and implemented it. The current return format is ({word_id => [topic_id_most_probable, topic_id_second_most_probable, ...]), which seems like a fair middle-ground.
Am still unsure of whether the phis are directly comparable, like you said...
@tmylk , could you clear that up and whether the new output format is ok?

bhargavvader · 2016-06-01T07:12:54Z

@tmylk , I think this is ready for a final review and merge. Could you have a look?

tmylk · 2016-06-01T17:04:31Z

Looks good. Just a changelog and an ipython notebook with colored words would be good. An example of static word assignment would help too.

piskvorky · 2016-06-01T23:06:27Z

Oh yes, good idea. A notebook to explain and demo this new functionality (coloured document words; topics for static terms) would be great.

Did you review the phi logic with @tmylk ?

bhargavvader · 2016-06-02T04:57:28Z

Yup, I have a nice idea for a notebook to accompany this. :)
@piskvorky , I've spoken to @tmylk about it and it seems fine - since the comparisons are between topics and not words, the feature length doesn't change the most probable topic(s). I've also tried it out on a few corpus/docs and it gives consistent results.

…picWord

bhargavvader · 2016-06-03T10:05:32Z

@tmylk , @piskvorky , updated the notebook to give a little more illustrations of Document Word coloring. I briefly explain the use case and new method functionalities before jumping into it.

Made changes to changelog as well, so if the logic and notebook are fine, it's ready to merge from my end.

piskvorky · 2016-06-08T09:41:22Z

Awesome, thanks! This is a much awaited functionality :)

@tmylk how do we get all these nice little notebooks to users, how do they find them?

bhargavvader · 2016-06-08T12:24:02Z

@piskvorky , could you have a brief look at the conversation on #683 towards the end and let me know if the idea of adding a per_word_phi_topics option to the user is viable?
So, just to confirm, we'll have get_document_topics(per_word_topics=False, per_word_phi_values=False) as the API, where if both are true then we'll return both a word_id => [most_probable_topic, ...], and also a word_id => [(topic_0, phi_value),....(topic_n, phi_value)]. If only one of them is true, we obviously return only that one.

The more popular one which will be used for practical purposes is the list of most probable topic_ids (using which I have demonstrated document word coloring), but if the user really wishes to have a look at the phi_values, I think that option should be there.

piskvorky · 2016-06-08T14:52:07Z

I see. It makes sense to me -- since we're implementing this, we might as well offer the "full power" to users.

Not sure we need an extra parameter though. Can't we always return the same thing (both mappings), for consistency?

tmylk · 2016-06-09T00:47:32Z

gensim/models/ldamodel.py

+            for word_type, weight in bow:
+                phi_values = [] # contains phi values for each topic
+                for topic_id in range(0, self.num_topics):
+                    if phis[topic_id][word_type] >= minimum_probability:


should it be a different parameter minimium_phi_probability ?

Hmm... Wouldn't hurt to add it I guess.

Just a note: if the user does not enter a minimum_phi_probability, it ends up taking self.minimum_probability's value.

bhargavvader · 2016-06-09T12:31:04Z

@tmylk , @piskvorky
Just an update: I've made changes in the documentation to reflect the phi_values being scaled, and the returning of both sorted topics and the phi_vales.
I've made changes to the tests to reflect this as well.

I have also added a small portion in the notebook to show phi_values and it's scaling - this should make it clear for people if they go through the notebook, they don't have to hunt for the scaling in the doc strings.

Just to clarify, right now if per_word_topics is true, it returns both sorted list of likely word-topics and the phi_values.

graychan · 2016-06-09T17:13:38Z

@bhargavvader , thanks a lot for making this revision for uses like me and answering my questions raised in the comment to issue #683.

I have a minor comment and hope that this is not a nuisance. I understand that you cannot really scale phi_values back using cts if you have multiple documents in bow. However, in the code, I wonder whether minimum_phi_probability should be changed to minimum_phi_value since the phi_values that are compared with minimum_phi_probability are not probabilities.

tmylk · 2016-06-10T03:13:31Z

@bhargavvader Did you test it on your Mac?

The test fails in all pythons in our OS X environment with xcod7.0 and xcode 7.3 too. It also fails in windows build (appveyor) with the same error.

https://travis-ci.org/MacPython/gensim-wheels/jobs/136597581
https://ci.appveyor.com/project/piskvorky/gensim-4kei4/build/job/5suhvxcbp9v0h1km

Can the test be made to be less sensitive to multihreading?

FAIL: testGetDocumentTopics (gensim.test.test_ldamodel.TestLdaMulticore)

Traceback (most recent call last):
File "/Users/travis/build/MacPython/gensim-wheels/venv/lib/python3.3/site-packages/gensim/test/test_ldamodel.py", line 287, in testGetDocumentTopics
self.assertEqual(word_topics[0][1], expected_topiclist)
nose.proxy.AssertionError: Lists differ: [0, 1] != [1, 0]
First differing element 0:
0
1

[0, 1]

[1, 0]

bhargavvader · 2016-06-10T04:49:00Z

@tmylk , that's very strange, it passes fine on my machine. And it should ideally pass on a multithreaded version as well. At any rate I'll make changes and open a PR.

@graychan I'm opening a PR to address the test problems, I'll change probabilities to values while I'm at it. :)

bhargavvader added 2 commits May 22, 2016 13:41

Per word topic

767e490

Added comments

a276eee

piskvorky assigned tmylk May 22, 2016

bhargavvader added 2 commits May 25, 2016 15:00

Added tests

705153c

Further tests for Doc Topic

3e480e0

tmylk reviewed May 25, 2016
View reviewed changes

Fixed test

3e01963

bhargavvader changed the title ~~[WIP] Per Word Topic in a Document.~~ [MRG] Per Word Topic in a Document. May 26, 2016

Changed format, added tests

3b45a3f

bhargavvader closed this May 27, 2016

bhargavvader reopened this May 27, 2016

piskvorky reviewed May 30, 2016
View reviewed changes

bhargavvader added 2 commits May 30, 2016 15:05

Change output format

b5bfb1c

PEP8 line spaces

775f8a7

bhargavvader closed this May 30, 2016

bhargavvader reopened this May 30, 2016

Resolving conflicts

0284630

bhargavvader mentioned this pull request May 31, 2016

API call for Topic Distribution of words. #683

Closed

Added notebook on new topic methods

c0b35fe

bhargavvader changed the title ~~[MRG] Per Word Topic in a Document.~~ [MRG] Per Word Topic in a Document + Notebook Tutorial Jun 2, 2016

bhargavvader added 3 commits June 3, 2016 14:13

Few changes to notebook

88bedc3

Merge branch 'develop' of https://github.com/piskvorky/gensim into To…

4b48160

…picWord

Changelog

0371aae

tmylk reviewed Jun 9, 2016
View reviewed changes

Change in API, notebook changes

f5b38f2

Changed test

04012b8

bhargavvader closed this Jun 9, 2016

bhargavvader reopened this Jun 9, 2016

tmylk merged commit 5e0e830 into piskvorky:develop Jun 9, 2016

bhargavvader mentioned this pull request Jun 10, 2016

testGetDocumentTopics and testTermTopics fail in Appveyor and OSX in MultiCoreLDA #740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

bhargavvader commented May 22, 2016

tmylk May 25, 2016

bhargavvader May 25, 2016

bhargavvader May 26, 2016

bhargavvader commented May 28, 2016

piskvorky May 30, 2016

bhargavvader commented May 30, 2016

bhargavvader commented Jun 1, 2016

tmylk commented Jun 1, 2016

piskvorky commented Jun 1, 2016 •

edited

Loading

bhargavvader commented Jun 2, 2016

bhargavvader commented Jun 3, 2016

piskvorky commented Jun 8, 2016

bhargavvader commented Jun 8, 2016

piskvorky commented Jun 8, 2016

tmylk Jun 9, 2016

bhargavvader Jun 9, 2016

bhargavvader Jun 9, 2016 •

edited

Loading

bhargavvader commented Jun 9, 2016

graychan commented Jun 9, 2016

tmylk commented Jun 10, 2016 •

edited

Loading

FAIL: testGetDocumentTopics (gensim.test.test_ldamodel.TestLdaMulticore)

bhargavvader commented Jun 10, 2016

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

[MRG] Per Word Topic in a Document + Notebook Tutorial #704

Conversation

bhargavvader commented May 22, 2016

tmylk May 25, 2016

Choose a reason for hiding this comment

bhargavvader May 25, 2016

Choose a reason for hiding this comment

bhargavvader May 26, 2016

Choose a reason for hiding this comment

bhargavvader commented May 28, 2016

piskvorky May 30, 2016

Choose a reason for hiding this comment

bhargavvader commented May 30, 2016

bhargavvader commented Jun 1, 2016

tmylk commented Jun 1, 2016

piskvorky commented Jun 1, 2016 • edited Loading

bhargavvader commented Jun 2, 2016

bhargavvader commented Jun 3, 2016

piskvorky commented Jun 8, 2016

bhargavvader commented Jun 8, 2016

piskvorky commented Jun 8, 2016

tmylk Jun 9, 2016

Choose a reason for hiding this comment

bhargavvader Jun 9, 2016

Choose a reason for hiding this comment

bhargavvader Jun 9, 2016 • edited Loading

Choose a reason for hiding this comment

bhargavvader commented Jun 9, 2016

graychan commented Jun 9, 2016

tmylk commented Jun 10, 2016 • edited Loading

FAIL: testGetDocumentTopics (gensim.test.test_ldamodel.TestLdaMulticore)

bhargavvader commented Jun 10, 2016

piskvorky commented Jun 1, 2016 •

edited

Loading

bhargavvader Jun 9, 2016 •

edited

Loading

tmylk commented Jun 10, 2016 •

edited

Loading