piskvorky · menshikh-iv · Aug 30, 2017 · Jun 7, 2017 · Jun 7, 2017 · Jun 7, 2017
diff --git a/docs/notebooks/Coherence.gif b/docs/notebooks/Coherence.gif
diff --git a/docs/notebooks/Convergence.gif b/docs/notebooks/Convergence.gif
diff --git a/docs/notebooks/Diff.gif b/docs/notebooks/Diff.gif
diff --git a/docs/notebooks/Perplexity.gif b/docs/notebooks/Perplexity.gif
diff --git a/docs/notebooks/Training_visualizations.ipynb b/docs/notebooks/Training_visualizations.ipynb
@@ -0,0 +1,281 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup Visdom\n",
+    "\n",
+    "Install it with:\n",
+    "\n",
+    "`pip install visdom`\n",
+    "\n",
+    "Start the server:\n",
+    "\n",
+    "`python -m visdom.server`\n",
+    "\n",
+    "Visdom now can be accessed at http://localhost:8097 in the browser.\n",
+    "\n",
+    "\n",
+    "# LDA Training Visualization\n",
+    "\n",
+    "To monitor the LDA training, a list of Metrics can be passed to the LDA function call for plotting their values live as the training progresses.  \n",
+    "\n",
+    "Let's plot the training stats for an LDA model being trained on Lee corpus. We will use the four evaluation metrics available for topic models in gensim: Coherence, Perplexity, Topic diff and Convergence. (using separate hold_out and test corpus for evaluating the perplexity)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using TensorFlow backend.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import re\n",
+    "import gensim\n",
+    "from gensim.models import ldamodel\n",
+    "from gensim.corpora.dictionary import Dictionary\n",
+    "\n",
+    "# Set file names for train data\n",
+    "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
+    "lee_corpus = test_data_dir + os.sep + 'lee.cor'\n",
+    "\n",
+    "def read_corpus(fname):\n",
+    "    texts = []\n",
+    "    with open(fname, encoding=\"ISO-8859-1\") as f:\n",
+    "        for line in f:\n",
+    "            # lower case all words\n",
+    "            lowered = line.lower()\n",
+    "            # remove punctuation and split into seperate words\n",
+    "            words = re.compile('\\w+').findall(lowered)\n",
+    "            texts.append(words)\n",
+    "    return texts\n",
+    "\n",
+    "texts = read_corpus(lee_corpus)\n",
+    "\n",
+    "# Split test data into hold_out and test corpus\n",
+    "training_texts = texts[:25]\n",
+    "holdout_texts = texts[25:40]\n",
+    "test_texts = texts[40:50]\n",
+    "\n",
+    "training_dictionary = Dictionary(training_texts)\n",
+    "holdout_dictionary = Dictionary(holdout_texts)\n",
+    "test_dictionary = Dictionary(test_texts)\n",
+    "\n",
+    "training_corpus = [training_dictionary.doc2bow(text) for text in training_texts]\n",
+    "holdout_corpus = [holdout_dictionary.doc2bow(text) for text in holdout_texts]\n",
+    "test_corpus = [test_dictionary.doc2bow(text) for text in test_texts]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric\n",
+    "\n",
+    "# define perplexity callback for hold_out and test corpus\n",
+    "pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"visdom\", viz_env=\"LdaModel\", title=\"Perplexity (hold_out)\")\n",
+    "pl_test = PerplexityMetric(corpus=test_corpus, logger=\"visdom\", viz_env=\"LdaModel\", title=\"Perplexity (test)\")\n",
+    "\n",
+    "# define other remaining metrics available\n",
+    "ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Coherence (u_mass)\")\n",
+    "diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Diff (kullback_leibler)\")\n",
+    "convergence_jc = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Convergence (jaccard)\")\n",
+    "\n",
+    "callbacks = [pl_holdout, pl_test, ch_umass, diff_kl, convergence_jc]\n",
+    "\n",
+    "# training LDA model\n",
+    "model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=5, num_topics=5, callbacks=callbacks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When the model is set for training, you can open http://localhost:8097 to see the training progress."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-22.4298221364\n"
+     ]
+    }
+   ],
+   "source": [
+    "# to get a metric value on a trained model\n",
+    "print(CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\").get_value(model=model))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The four types of graphs which are plotted for LDA:\n",
+    "\n",
+    "**Coherence**\n",
+    "\n",
+    "Coherence is a measure used to evaluate topic models. A good model will generate coherent topics, i.e., topics         with high topic coherence scores. Good topics are topics that can be described by a short label based on the topic     terms they spit out. \n",
+    "\n",
+    "<img src=\"Coherence.gif\">\n",
+    "\n",
+    "Now, this graph along with the others explained below, can be used to decide if it's time to stop the training. We     can see if the value stops changing after some epochs and that we are able to get the highest possible coherence       of our model.  \n",
+    "\n",
+    "\n",
+    "**Perplexity**\n",
+    "\n",
+    "Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In LDA, topics are described by a probability distribution over vocabulary words. So, perplexity can be used to compare probabilistic models like LDA.\n",
+    "\n",
+    "<img src=\"Perplexity.gif\">\n",
+    "\n",
+    "For a good model, perplexity should be as low as possible.\n",
+    "\n",
+    "\n",
+    "**Topic Difference**\n",
+    "\n",
+    "Topic Diff calculates the distance between two LDA models. This distance is calculated based on the topics, by either using their probability distribution over vocabulary words (kullback_leibler, hellinger) or by simply using the common vocabulary words between the topics from both model.\n",
+    "\n",
+    "<img src=\"Diff.gif\">\n",
+    "\n",
+    "In the heatmap, X-axis define the Epoch no. and Y-axis define the distance between the identical topic from consecutive epochs. For ex. a particular cell in the heatmap with values (x=3, y=5, z=0.4) represent the distance(=0.4) between the topic 5 from 3rd epoch and topic 5 from 2nd epoch. With increasing epochs, the distance between the identical topics should decrease.\n",
+    "  \n",
+    "  \n",
+    "**Convergence**\n",
+    "\n",
+    "Convergence is the sum of the difference between all the identical topics from two consecutive epochs. It is basically the sum of column values in the heatmap above.\n",
+    "\n",
+    "<img src=\"Convergence.gif\">\n",
+    "\n",
+    "The model is said to be converged when the convergence value stops descending with increasing epochs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Training Logs\n",
+    "\n",
+    "We can also log the metric values after every epoch to the shell apart from visualizing them in Visdom. The only change is to define `logger=\"shell\"` instead of `\"visdom\"` in the input callbacks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:gensim.models.ldamodel:using symmetric alpha at 0.2\n",
+      "INFO:gensim.models.ldamodel:using symmetric eta at 0.0009950248756218905\n",
+      "INFO:gensim.models.ldamodel:using serial LDA version on this node\n",
+      "INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 5 topics, 3 passes over the supplied corpus of 25 documents, updating model once every 25 documents, evaluating perplexity every 25 documents, iterating 50x with a convergence threshold of 0.001000\n",
+      "WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n",
+      "INFO:gensim.models.ldamodel:-9.032 per-word bound, 523.4 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n",
+      "INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #25/25\n",
+      "INFO:gensim.models.ldamodel:topic #0 (0.200): 0.025*\"the\" + 0.022*\"to\" + 0.013*\"in\" + 0.011*\"of\" + 0.010*\"and\" + 0.007*\"s\" + 0.007*\"a\" + 0.005*\"said\" + 0.004*\"he\" + 0.004*\"iraqi\"\n",
+      "INFO:gensim.models.ldamodel:topic #1 (0.200): 0.054*\"the\" + 0.028*\"to\" + 0.023*\"of\" + 0.019*\"and\" + 0.017*\"in\" + 0.016*\"a\" + 0.013*\"s\" + 0.009*\"that\" + 0.008*\"has\" + 0.007*\"on\"\n",
+      "INFO:gensim.models.ldamodel:topic #2 (0.200): 0.049*\"the\" + 0.027*\"in\" + 0.024*\"to\" + 0.016*\"of\" + 0.012*\"and\" + 0.007*\"a\" + 0.007*\"that\" + 0.006*\"is\" + 0.006*\"for\" + 0.006*\"work\"\n",
+      "INFO:gensim.models.ldamodel:topic #3 (0.200): 0.041*\"the\" + 0.024*\"of\" + 0.021*\"to\" + 0.016*\"a\" + 0.016*\"and\" + 0.012*\"s\" + 0.010*\"in\" + 0.007*\"as\" + 0.006*\"with\" + 0.006*\"party\"\n",
+      "INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n",
+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n",
+      "INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n",
+      "INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n",
+      "INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712  0.96670853  0.8036907   0.72372737  0.63141336]\n",
+      "INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n",
+      "INFO:gensim.models.ldamodel:-7.121 per-word bound, 139.2 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n",
+      "INFO:gensim.models.ldamodel:PROGRESS: pass 1, at document #25/25\n",
+      "INFO:gensim.models.ldamodel:topic #0 (0.200): 0.024*\"the\" + 0.015*\"to\" + 0.014*\"in\" + 0.008*\"iraqi\" + 0.008*\"of\" + 0.008*\"s\" + 0.007*\"and\" + 0.007*\"said\" + 0.005*\"u\" + 0.005*\"air\"\n",
+      "INFO:gensim.models.ldamodel:topic #1 (0.200): 0.056*\"the\" + 0.029*\"to\" + 0.025*\"of\" + 0.022*\"and\" + 0.017*\"in\" + 0.015*\"a\" + 0.013*\"s\" + 0.008*\"that\" + 0.008*\"has\" + 0.008*\"on\"\n",
+      "INFO:gensim.models.ldamodel:topic #2 (0.200): 0.050*\"the\" + 0.027*\"in\" + 0.025*\"to\" + 0.017*\"of\" + 0.010*\"and\" + 0.008*\"is\" + 0.007*\"be\" + 0.007*\"work\" + 0.007*\"a\" + 0.007*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic #3 (0.200): 0.031*\"the\" + 0.020*\"of\" + 0.019*\"to\" + 0.018*\"a\" + 0.014*\"and\" + 0.012*\"s\" + 0.009*\"in\" + 0.008*\"as\" + 0.007*\"party\" + 0.007*\"by\"\n",
+      "INFO:gensim.models.ldamodel:topic #4 (0.200): 0.033*\"the\" + 0.022*\"of\" + 0.014*\"and\" + 0.014*\"a\" + 0.012*\"in\" + 0.010*\"to\" + 0.010*\"iraq\" + 0.008*\"with\" + 0.008*\"that\" + 0.008*\"as\"\n",
+      "INFO:gensim.models.ldamodel:topic diff=0.570396, rho=0.577350\n",
+      "INFO:LdaModel:Epoch 1: Perplexity estimate: 439.092043455\n",
+      "INFO:LdaModel:Epoch 1: Perplexity estimate: 534.74704957\n",
+      "INFO:LdaModel:Epoch 1: Coherence estimate: -22.4302876151\n",
+      "INFO:LdaModel:Epoch 1: Diff estimate: [ 0.08300085  0.0312854   0.06168011  0.06740617  0.08645297]\n",
+      "INFO:LdaModel:Epoch 1: Convergence estimate: 0.0\n",
+      "INFO:gensim.models.ldamodel:-6.842 per-word bound, 114.7 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n",
+      "INFO:gensim.models.ldamodel:PROGRESS: pass 2, at document #25/25\n",
+      "INFO:gensim.models.ldamodel:topic #0 (0.200): 0.023*\"the\" + 0.014*\"in\" + 0.011*\"to\" + 0.009*\"iraqi\" + 0.008*\"s\" + 0.007*\"said\" + 0.007*\"u\" + 0.007*\"air\" + 0.007*\"coalition\" + 0.007*\"military\"\n",
+      "INFO:gensim.models.ldamodel:topic #1 (0.200): 0.056*\"the\" + 0.030*\"to\" + 0.026*\"of\" + 0.022*\"and\" + 0.017*\"in\" + 0.014*\"a\" + 0.014*\"s\" + 0.008*\"has\" + 0.008*\"on\" + 0.008*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic #2 (0.200): 0.050*\"the\" + 0.026*\"in\" + 0.025*\"to\" + 0.017*\"of\" + 0.008*\"and\" + 0.008*\"is\" + 0.008*\"be\" + 0.008*\"work\" + 0.007*\"a\" + 0.007*\"that\"\n",
+      "INFO:gensim.models.ldamodel:topic #3 (0.200): 0.027*\"the\" + 0.021*\"a\" + 0.019*\"of\" + 0.019*\"to\" + 0.013*\"s\" + 0.013*\"and\" + 0.009*\"in\" + 0.009*\"as\" + 0.009*\"party\" + 0.008*\"by\"\n",
+      "INFO:gensim.models.ldamodel:topic #4 (0.200): 0.036*\"the\" + 0.022*\"of\" + 0.015*\"and\" + 0.013*\"a\" + 0.012*\"in\" + 0.011*\"iraq\" + 0.011*\"to\" + 0.010*\"with\" + 0.009*\"that\" + 0.008*\"are\"\n",
+      "INFO:gensim.models.ldamodel:topic diff=0.363037, rho=0.500000\n",
+      "INFO:LdaModel:Epoch 2: Perplexity estimate: 413.6047717\n",
+      "INFO:LdaModel:Epoch 2: Perplexity estimate: 515.596865513\n",
+      "INFO:LdaModel:Epoch 2: Coherence estimate: -22.4273737487\n",
+      "INFO:LdaModel:Epoch 2: Diff estimate: [ 0.01187035  0.01098017  0.00861298  0.02232568  0.01953778]\n",
+      "INFO:LdaModel:Epoch 2: Convergence estimate: 0.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "import logging\n",
+    "\n",
+    "logging.basicConfig(level=logging.INFO)\n",
+    "logger = logging.getLogger(__name__)\n",
+    "logger.setLevel(logging.DEBUG)\n",
+    "\n",
+    "# define perplexity callback for hold_out and test corpus\n",
+    "pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"shell\")\n",
+    "pl_test = PerplexityMetric(corpus=test_corpus, logger=\"shell\")\n",
+    "\n",
+    "# define other remaining metrics available\n",
+    "ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"shell\")\n",
+    "diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"shell\")\n",
+    "convergence_jc = ConvergenceMetric(distance=\"jaccard\", logger=\"shell\")\n",
+    "\n",
+    "callbacks = [pl_holdout, pl_test, ch_umass, diff_kl, convergence_jc]\n",
+    "\n",
+    "# training LDA model\n",
+    "model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=3, num_topics=5, callbacks=callbacks)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.4.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}