[MRG] Lda training visualization in visdom #1399

parulsethi · 2017-06-07T12:19:01Z

This PR adds an option to visualize LDA evaluation parameters in real-time while training/or after, using visdom.

Add notebook

…into tensorboard_logs

parulsethi · 2017-06-16T21:19:53Z

Updated this PR to use visdom, removed tensorboard code.

There are few parameters (distance, coherence, texts, window_size, topn) in ldamodel required for two different fuctionalities (one for visualization in this PR and other for logging in #1381).

Maybe we can define a dict based input for visualization related parameters differently, for example Viz={param1:value1, param2:value2, ..}. Or otherwise, what alternatives can be used?

tmylk · 2017-06-20T20:11:09Z

Please add the ipynb with instructions on how to setup visdom to show the live stats

menshikh-iv · 2017-06-28T16:57:13Z

gensim/models/ldamodel.py

+                 alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
+                 iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
+                 random_state=None, ns_conf={}, minimum_phi_value=0.01,
+                 per_word_topics=False, viz=False, env=None, distance="kulback_leibler",


Move all parameters to metrics
by default, metrics={},
in another case, metrics contains needed parameters for viz some things, for example

metrics = { 'diff_distance': 'jaccard', 'coherence_windows_size': 10, ... }

and don't forget about args validation (write a tests)

Validation about every arg's object type? Or for the valid values of 'diff_distance' and 'coherence' (though, for which there is a test already in their respective functions of Diff and CoherenceModel)?

menshikh-iv · 2017-06-28T16:58:33Z

docs/notebooks/Training_visualizations.ipynb

+    "    Now, this graph along with the others explained below, can be used to decide if it's time to stop the training. We     can see if the value stops changing after some epochs and that we are able to get the highest possible coherence       of our model.  \n",
+    "\n",
+    "\n",
+    "2. **Perplexity**\n",


The numbers will not be rendered correctly, please remove them

menshikh-iv · 2017-06-28T16:59:17Z

gensim/models/ldamodel.py

+
+        `env` defines the environment to use in visdom browser
+
+        `distance` measure to be used for Diff plot visualization


Edit docstring in accordance with comment about viz

menshikh-iv · 2017-06-28T17:00:25Z

gensim/models/ldamodel.py

-               gamma_threshold=None, chunks_as_numpy=False):
+    def update(self, corpus, chunksize=None, decay=None, offset=None, passes=None, update_every=None,
+               eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False,
+               viz=None, env=None, distance=None, coherence=None, texts=None, window_size=None, topn=None):


Please add metrics for all args about metrics

menshikh-iv · 2017-06-28T17:01:44Z

gensim/models/ldamodel.py

+            if self.viz:
+                # calculate coherence
+                cm = gensim.models.CoherenceModel(model=self, corpus=corpus, texts=texts, coherence=coherence, window_size=window_size, topn=topn)
+                Coherence = np.array([cm.get_coherence()])


the variable name should be in lowercase (here and anywhere)

menshikh-iv · 2017-06-28T17:04:45Z

gensim/models/ldamodel.py

            if reallen != lencorpus:
                raise RuntimeError("input corpus size changed during training (don't use generators as input)")

+            if self.viz:
+                # calculate coherence


move this block in separate function (with stat calc)

EmilStenstrom · 2017-07-03T05:25:42Z

The error för Travis:

======================================================================

ERROR: testIdentity (gensim.test.test_tmdiff.TestLdaDiff)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_tmdiff.py", line 42, in testIdentity

    mdiff, annotation = self.model.diff(self.model, n_ann_terms=self.n_ann_terms, distance=dist_name)

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/ldamodel.py", line 1106, in diff

    raise ValueError("Incorrect distance, valid only {}".format(valid_keys))

ValueError: Incorrect distance, valid only `jaccard`, `kullback_leibler`, `hellinger`

Seems like a real error.

parulsethi · 2017-07-03T16:31:26Z

There's a typo in "kullback_leibler" in diff function here. I replaced it with correct spelling in this PR but forgot to make the replacement in tests also.

Though this PR might get a bit delayed due to API decisions so I'll just make another PR for correcting this typo.

…into tensorboard_logs

parulsethi · 2017-07-10T12:48:11Z

@tmylk @menshikh-iv updated the API structure as discussed recently

menshikh-iv · 2017-07-13T17:18:27Z

gensim/models/ldamodel.py

+                 alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
+                 iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
+                 random_state=None, ns_conf={}, minimum_phi_value=0.01,
+                 per_word_topics=False, callbacks=None):


Add description for callbacks in docstring

menshikh-iv · 2017-07-13T17:20:25Z

gensim/models/callbacks.py

+        if any(isinstance(metric, (DiffMetric, ConvergenceMetric)) for metric in self.metrics):
+            self.previous = copy.deepcopy(model)
+            # store diff diagnols of previous epochs
+            self.diff_mat = Queue()


What's a reason to use queue (not list)?

As there could be any no. of diff/convergence metric input (for ex. with different distance measures), so by using queue, we won't need to keep track of their no. and epoch count, and simply depend on the sequence order.

menshikh-iv · 2017-07-13T17:23:21Z

gensim/models/callbacks.py

+
+
+class CoherenceMetric(Metric):
+    def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=None, logger=None, viz_env=None, title=None):


Add docstring for all parameters with description (here and anywhere).

Maybe it's a good idea to use logger="shell" by default for all scalar metrics?

menshikh-iv · 2017-07-13T17:25:15Z

docs/notebooks/Training_visualizations.ipynb

@@ -0,0 +1,189 @@
+{


Please add an example with logger="shell" in notebook (and show logging output in notebook)

piskvorky

Finally have some time for review, wrapping my head around the classes and callbacks here :) Your help will be appreciated.

piskvorky · 2017-07-27T11:50:05Z

gensim/models/callbacks.py

+
+    def get_value(self, **parameters):
+        """
+        Set the parameters


Isn't get_value a misnomer for setting parameters?

Yes, I'll replace with set_parameters

piskvorky · 2017-07-27T12:13:44Z

docs/notebooks/Training_visualizations.ipynb

+    "# define other remaining metrics available\n",
+    "ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Coherence (u_mass)\")\n",
+    "diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Diff (kullback_leibler)\")\n",
+    "convergence_jc = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Convergence (jaccard)\")\n",


Docs say jaccard, but distance is hellinger.

piskvorky · 2017-07-27T12:16:00Z

docs/notebooks/Training_visualizations.ipynb

+   "source": [
+    "# Training Logs\n",
+    "\n",
+    "We can also log the metric values after every epoch to the shell apart from visualizing them in Visdom. The only change is to define `logger=\"shell\"` instead of `\"visdom\"` in the input callbacks."


Can you add an example on how to get the values out programatically (no logging or plotting, using them from arbitrary Python code instead)?

It can be done by using a loop to manually iterate over model and call metric classes at the end to store value:

model=LdaModel(passes=1) perplexity=[] for epoch in range(epochs): model.update(passes=1) pl = PerplexityMetric().get_value(model) perplexity.append(pl)

and sure, I'll add an example for this in notebook.

Thanks. Is there a way to use the callbacks in a way that they collect this info?

I'm thinking a type of "logger" that instead of logging, appends the value to some internal list. Which other parts of the app can read from.

The idea is the interface would be the same as for logger=visdom/shell, without a need for an explicit outer loop like in your example.

Is that possible?

Yes we can store them just after they are calculated in this step.

Maybe in a dict which could be an attribute of LdaModel? Structure could be:

metrics = {'PerplexityMetric':[val1, val2, ...], 'DiffMetric':[val1, val2, ...] }

on_epoch_end() can be made to return the metric values from current epoch which could then be appended in metrics dict after this step.

Used metrics dict to save values as described above

menshikh-iv · 2017-08-03T10:28:08Z

docs/notebooks/Training_visualizations.ipynb

+      "INFO:gensim.models.ldamodel:topic #3 (0.200): 0.041*\"the\" + 0.024*\"of\" + 0.021*\"to\" + 0.016*\"a\" + 0.016*\"and\" + 0.012*\"s\" + 0.010*\"in\" + 0.007*\"as\" + 0.006*\"with\" + 0.006*\"party\"\n",
+      "INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n",
+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n",


Please use same logging format as in LdaModel (INFO:gensim.models.ldamodel)

menshikh-iv · 2017-08-03T10:28:32Z

docs/notebooks/Training_visualizations.ipynb

+      "INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n",
+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n",
+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n",


Why perplexity showed twice (and different values)?

One is for holdout corpus and other for test corpus. Though, updated the log statement now to define metric name from the input parameter title

menshikh-iv · 2017-08-03T10:29:09Z

docs/notebooks/Training_visualizations.ipynb

+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n",
+      "INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n",
+      "INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712  0.96670853  0.8036907   0.72372737  0.63141336]\n",
+      "INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n",


Why always zero ?

Previous state of model (needed to calculate Convergence) was being saved at wrong place. Corrected it

menshikh-iv · 2017-08-03T10:32:48Z

docs/notebooks/Training_visualizations.ipynb

+      "INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n",
+      "INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712  0.96670853  0.8036907   0.72372737  0.63141336]\n",
+      "INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n",
+      "INFO:gensim.models.ldamodel:-7.121 per-word bound, 139.2 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n",


Perplexity calculates twice, wdyt about it @parulsethi? So, you should disable evaluation flag in a notebook.

Disabled eval_every

menshikh-iv · 2017-08-03T10:37:56Z

gensim/models/callbacks.py

+                               ['graph', 'trees', 'binary', 'widths']]
+        """
+        # only one of the model or topic would be defined
+        self.model = None


Why should you do this assignment? (only in current Callback)

As both model and topics can be used to calculate Coherence, and only one of them would be defined in **kwargs. So this assignment is just to avoid name not defined error for the other variable which is not in **kwargs.

menshikh-iv · 2017-08-03T10:39:57Z

gensim/models/callbacks.py

+            other_model : second topic model instance to calculate the difference from
+        """
+        super(DiffMetric, self).set_parameters(**kwargs)
+        diff_matrix, _ = self.model.diff(self.other_model, self.distance, self.num_words, self.n_ann_terms, self.normed)


Now you can use new version for diff (with diagonal and annotation flags)

piskvorky

Minor code style comments.

piskvorky · 2017-08-03T11:22:51Z

docs/notebooks/Training_visualizations.ipynb

    "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
-    "lee_corpus = test_data_dir + os.sep + 'lee.cor'\n",
+    "lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",


os.path.join more standard.

piskvorky · 2017-08-03T11:24:08Z

gensim/models/ldamodel.py

            callback = Callback(self.callbacks)
            callback.set_model(self)
+            # initialize metrics dict to store metric values after every epoch
+            self.metrics = {}
+            for metric in self.callbacks:


Dict comprehension more readable?

Also, defaultdict might make the logic a little simpler.

Updated to use defaultdict

menshikh-iv · 2017-08-10T10:58:42Z

gensim/models/callbacks.py

+        # plot all metrics in current epoch
+        for i, metric in enumerate(self.metrics):
+            value = metric.get_value(topics=topics, model=self.model, other_model=self.previous)
+            if metric.title is not None:


please remove this if/else, instead of this define __str__ or __repr__ method for each metric class, after it you can use label = str(metric)

menshikh-iv · 2017-08-10T11:02:39Z

gensim/models/callbacks.py

+    """
+    Base Metric class for topic model evaluation metrics
+    """
+    def __init__(self):


No need to define empty __init__ in base class

menshikh-iv · 2017-08-10T11:06:33Z

LGTM I think, last minor changes and I'll merge it

menshikh-iv · 2017-08-10T19:13:33Z

gensim/models/callbacks.py

@@ -87,6 +93,9 @@ def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, win
        self.viz_env = viz_env
        self.title = title

+    def __str__(self):


By default, if you have no method in child-class, this method will be called from parent class -> no need to call __str__ from each callback explicitly.

menshikh-iv · 2017-08-17T10:46:50Z

docs/notebooks/Training_visualizations.ipynb

+    "\n",
+    "def read_corpus(fname):\n",
+    "    texts = []\n",
+    "    with open(fname, encoding=\"ISO-8859-1\") as f:\n",


Don't work for python2 (because encoding isn't supported in python2), replace it to smart_open.

menshikh-iv · 2017-08-17T10:51:18Z

docs/notebooks/Training_visualizations.ipynb

+    "\n",
+    "\n",
+    "# Set file names for train and test data\n",
+    "test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\n",


Can you use large dataset for this? (download from link in notebook)

menshikh-iv · 2017-08-17T10:52:19Z

docs/notebooks/Training_visualizations.ipynb

+    "test_dictionary = Dictionary(test_texts)\n",
+    "\n",
+    "training_corpus = [training_dictionary.doc2bow(text) for text in training_texts]\n",
+    "holdout_corpus = [holdout_dictionary.doc2bow(text) for text in holdout_texts]\n",


It's 3 different mappings, mistake.

You should fit your Dictionary on training_texts and use it for all conversions (for holdout/test too)

tmylk · 2017-08-17T12:07:13Z

Could you please add to ipynb an actual screenshot of the entire Vizdom vis that your code generates? with both train/hold-out perplexity in particular.

… into tensorboard_logs

menshikh-iv · 2017-08-30T11:08:44Z

Congratulations @parulsethi, this summer you did a lot of viz for gensim.
This was the last PR and now you are finished your GSoC project 🔥 🥇 🔥.

parulsethi added 12 commits June 7, 2017 16:20

save log params in a dict

bb65439

remove redundant line

9d2e78d

add diff log

33818ec

remove diff log

281222c

write params to log directory

c507bbb

add convergence, remove alpha

6f75ccc

calculate perplexity/diff instead of using log function

d9db4e2

add docstrings and comments

cd5f822

add coherence/diff labels in graphs

f4728e0

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

40cf092

…into tensorboard_logs

optional measures for viz

d4f69f5

add coherence params to lda init

fde7d4d

parulsethi changed the title ~~[WIP] Lda logging for Tensorboard~~ [WIP] Lda training visualization in visdom Jun 16, 2017

parulsethi mentioned this pull request Jun 16, 2017

Add coherence and Diff logging for LDA #1381

Closed

menshikh-iv mentioned this pull request Jun 22, 2017

[WIP][DNM] Visualize topic model difference (need feedback) #1243

Closed

parulsethi added 2 commits June 26, 2017 17:19

added Lda Visom viz notebook

3f18076

add option to specify env

546908e

parulsethi changed the title ~~[WIP] Lda training visualization in visdom~~ Lda training visualization in visdom Jun 27, 2017

menshikh-iv suggested changes Jun 28, 2017

View reviewed changes

made requested changes

651a61a

parulsethi added 4 commits July 8, 2017 20:49

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

13dfddc

…into tensorboard_logs

add generic callback API

1376d90

modified Notebook for new API

44c8e58

fix flake8

92949a3

correct lee corpus division

5b22e4d

menshikh-iv suggested changes Jul 13, 2017

View reviewed changes

parulsethi added 4 commits July 17, 2017 16:18

added docstrings

c369fc5

fix flake8

a32960d

add shell example

48526d9

fix queue import for both py2/py3

adf2a60

piskvorky reviewed Jul 27, 2017

View reviewed changes

parulsethi added 2 commits August 2, 2017 23:56

store metrics in model instance

a272090

add nb example for getting metrics after train

d3389bb

menshikh-iv suggested changes Aug 3, 2017

View reviewed changes

piskvorky reviewed Aug 3, 2017

View reviewed changes

parulsethi added 3 commits August 8, 2017 13:02

merge develop

96949f7

made rquested changes

7d0f0ec

use dict for saving metrics

dcc64a1

menshikh-iv suggested changes Aug 10, 2017

View reviewed changes

parulsethi added 2 commits August 10, 2017 22:57

use str method for metric classes

47434f9

correct a notebook description

30c9b64

parulsethi changed the title ~~Lda training visualization in visdom~~ [MRG] Lda training visualization in visdom Aug 10, 2017

menshikh-iv reviewed Aug 10, 2017

View reviewed changes

remove child-classes str method

e55af47

menshikh-iv suggested changes Aug 17, 2017

View reviewed changes

parulsethi added 2 commits August 24, 2017 02:23

made requested changes

df5e01f

Merge branch 'develop' into tensorboard_logs

b334c50

parulsethi mentioned this pull request Aug 24, 2017

Log perplexity during LDA model fitting. #1546

Closed

parulsethi added 2 commits August 24, 2017 07:55

add visdom screenshot

c54e6bf

Merge branch 'tensorboard_logs' of https://github.com/parulsethi/gensim…

5f3d902

… into tensorboard_logs

menshikh-iv merged commit ae31c0c into piskvorky:develop Aug 30, 2017


		`env` defines the environment to use in visdom browser

		`distance` measure to be used for Diff plot visualization



		class CoherenceMetric(Metric):
		def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=None, logger=None, viz_env=None, title=None):

[MRG] Lda training visualization in visdom #1399

[MRG] Lda training visualization in visdom #1399

Conversation

parulsethi commented Jun 7, 2017 • edited Loading

parulsethi commented Jun 16, 2017

tmylk commented Jun 20, 2017

menshikh-iv Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EmilStenstrom commented Jul 3, 2017

parulsethi commented Jul 3, 2017

parulsethi commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Jul 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parulsethi Aug 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Aug 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Aug 17, 2017

menshikh-iv commented Aug 30, 2017

parulsethi commented Jun 7, 2017 •

edited

Loading

menshikh-iv Jun 28, 2017 •

edited

Loading

menshikh-iv Jun 28, 2017 •

edited

Loading

parulsethi Jul 17, 2017 •

edited

Loading

parulsethi Jul 27, 2017 •

edited

Loading

piskvorky Jul 27, 2017 •

edited

Loading

piskvorky Aug 3, 2017 •

edited

Loading

parulsethi Aug 9, 2017 •

edited

Loading