Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor API reference gensim.topic_coherence. Fix #1669 #1714

Merged
merged 42 commits into from
Jan 10, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
29a8a37
Refactored aggregation
CLearERR Nov 13, 2017
56eda23
Micro-Fix for aggregation.py, partially refactored direct_confirmatio…
CLearERR Nov 14, 2017
edd53d4
Partially refactored indirect_confirmation_measure
CLearERR Nov 15, 2017
cfd6050
Some additions
CLearERR Nov 16, 2017
390b01e
Math attempts
CLearERR Nov 19, 2017
8b1a5ca
add math extension for sphinx
menshikh-iv Nov 20, 2017
8d2c584
Minor refactoring
CLearERR Nov 21, 2017
6eb8335
Some refactoring for probability_estimation
CLearERR Nov 22, 2017
7a47f05
Beta-strings
CLearERR Nov 23, 2017
667cad2
Different additions
CLearERR Nov 25, 2017
d41c5a3
Minor changes
CLearERR Nov 26, 2017
180c1c1
text_analysis left
CLearERR Nov 27, 2017
e3c1e29
Added example for ContextVectorComputer class
CLearERR Nov 28, 2017
da9ca29
probability_estimation 0.9
CLearERR Nov 29, 2017
f54fb0c
beta_version
CLearERR Nov 30, 2017
47ee63e
Added some examples for text_analysis
CLearERR Dec 3, 2017
65211f0
text_analysis: corrected example for class UsesDictionary
CLearERR Dec 4, 2017
c484962
Final additions for text_analysis.py
CLearERR Dec 7, 2017
71bb2bf
Merge branch 'develop' into fix-1669
menshikh-iv Dec 11, 2017
d9237ea
fix cross-reference problem
menshikh-iv Dec 11, 2017
275edd0
fix pep8
menshikh-iv Dec 11, 2017
94bde33
fix aggregation
menshikh-iv Dec 11, 2017
782d5cf
fix direct_confirmation_measure
menshikh-iv Dec 11, 2017
81732ef
fix types in direct_confirmation_measure
menshikh-iv Dec 11, 2017
3c7b401
partial fix indirect_confirmation_measure
menshikh-iv Dec 11, 2017
206784d
HotFix for probability_estimation and segmentation
CLearERR Dec 12, 2017
406ab5c
Merge branch 'fix-1669' of https://github.com/CLearERR/gensim into fi…
CLearERR Dec 12, 2017
67962be
Refactoring for probability_estimation
CLearERR Dec 12, 2017
74c5c86
Changes for indirect_confirmation_measure
CLearERR Dec 14, 2017
ef058df
Fixed segmentation, partly fixed text_analysis
CLearERR Dec 18, 2017
0b06468
Add Notes for text_analysis
CLearERR Dec 18, 2017
e3779d4
fix di/ind
menshikh-iv Dec 19, 2017
482377b
fix doc examples in probability_estimation
menshikh-iv Dec 19, 2017
acdebb1
fix probability_estimation
menshikh-iv Dec 20, 2017
8a07dee
fix segmentation
menshikh-iv Dec 20, 2017
63c35c2
fix docstring in probability_estimation
menshikh-iv Dec 20, 2017
4b63f6c
partial fix test_analysis
menshikh-iv Dec 20, 2017
540021c
add latex stuff for docs build
menshikh-iv Dec 20, 2017
790e07d
merge upstream
menshikh-iv Jan 10, 2018
965587b
doc fix[1]
menshikh-iv Jan 10, 2018
f8f25cb
doc fix[2]
menshikh-iv Jan 10, 2018
f42ad8f
remove apt install from travis (now doc build in circle)
menshikh-iv Jan 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions gensim/topic_coherence/direct_confirmation_measure.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,12 @@ def aggregate_segment_sims(segment_sims, with_std, with_support):

Parameters
----------
segment_sims : iterable
floating point similarity values to aggregate.
with_std : bool
Set to True to include standard deviation.
with_support : bool
Set to True to include number of elements in `segment_sims` as a statistic in the results returned.
segment_sims : iterable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterable of ??

floating point similarity values to aggregate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to describe type in comment (floating point), please move it to type

with_std : bool
Set to True to include standard deviation.
with_support : bool
Set to True to include number of elements in `segment_sims` as a statistic in the results returned.

Returns
-------
Expand Down Expand Up @@ -124,7 +124,7 @@ def log_ratio_measure(
segmented_topics : list of (list of tuples)
Output from the segmentation module of the segmented topics.
accumulator: list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of ?

word occurrence accumulator from probability_estimation.
Word occurrence accumulator from probability_estimation.
with_std : bool
True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
Expand Down
63 changes: 43 additions & 20 deletions gensim/topic_coherence/indirect_confirmation_measure.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
# Copyright (C) 2013 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

r"""
This module contains functions to compute confirmation on a pair of words or word subsets.
r"""This module contains functions to compute confirmation on a pair of words or word subsets.

Notes
-----
The advantage of indirect confirmation measure is that it computes similarity of words in W' and
W* with respect to direct confirmations to all words. Eg. Suppose x and z are both competing
brands of cars, which semantically support each other. However, both brands are seldom mentioned
Expand All @@ -25,6 +26,7 @@
\Bigg \{{\sum_{w_{i} \in W'}^{ } m(w_{i}, w_{j})^{\gamma}}\Bigg \}_{j = 1,...,|W|}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use :math: for all formulas (please have a look to sphinx math/latex doc)


Here 'm' is the direct confirmation measure used.

"""

import itertools
Expand Down Expand Up @@ -126,24 +128,45 @@ def cosine_similarity(
\vec{V}^{\,}_{m,\gamma}(W') =
\Bigg \{{\sum_{w_{i} \in W'}^{ } m(w_{i}, w_{j})^{\gamma}}\Bigg \}_{j = 1,...,|W|}

Args:
segmented_topics: Output from the segmentation module of the
segmented topics. Is a list of list of tuples.
accumulator: Output from the probability_estimation module. Is an
accumulator of word occurrences (see text_analysis module).
topics: Topics obtained from the trained topic model.
measure (str): Direct confirmation measure to be used. Supported
values are "nlr" (normalized log ratio).
gamma: Gamma value for computing W', W* vectors; default is 1.
with_std (bool): True to also include standard deviation across topic
segment sets in addition to the mean coherence for each topic;
default is False.
with_support (bool): True to also include support across topic segments.
The support is defined as the number of pairwise similarity
comparisons were used to compute the overall topic coherence.

Returns:
list: of indirect cosine similarity measure for each topic.
Parameters
----------
segmented_topics: list of (list of tuples)
Output from the segmentation module of the segmented topics.
accumulator: accumulator of word occurrences (see text_analysis module).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't a type, you can add a link to concrete class like + links to module

:class:`~gensim.topic_coherence...`

Output from the probability_estimation module. Is an topics: Topics obtained from the trained topic model.
measure : str
Direct confirmation measure to be used. Supported values are "nlr" (normalized log ratio).
gamma:
Gamma value for computing W', W* vectors; default is 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:math: + no need to duplicate default value if it already defined OK in the signature (here and everywhere)

with_std : bool
True to also include standard deviation across topic segment sets in addition to the mean coherence
for each topic; default is False.
with_support : bool
True to also include support across topic segments. The support is defined as the number of pairwise similarity
comparisons were used to compute the overall topic coherence.

Returns
-------
list
List of indirect cosine similarity measure for each topic.

Examples
--------
>>> from gensim.corpora.dictionary import Dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So big example, need comments like "what happens here"

>>> from gensim.topic_coherence import indirect_confirmation_measure,text_analysis
>>> import numpy as np
>>> dictionary = Dictionary()
>>> dictionary.id2token = {1: 'fake', 2: 'tokens'}
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
>>> accumulator._num_docs = 5
>>> topics = [np.array([1, 2])]
>>> segmentation = [[(1, np.array([1, 2])), (2, np.array([1, 2]))]]
>>> gamma = 1
>>> measure = 'nlr'
>>> obtained = indirect_confirmation_measure.cosine_similarity(segmentation, accumulator, topics, measure, gamma)
>>> print obtained[0]
0.623018926945

"""
context_vectors = ContextVectorComputer(measure, topics, accumulator, gamma)
Expand Down