-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Lda training visualization in visdom #1399
Changes from 30 commits
bb65439
9d2e78d
33818ec
281222c
c507bbb
6f75ccc
d9db4e2
cd5f822
f4728e0
40cf092
d4f69f5
fde7d4d
3f18076
546908e
651a61a
13dfddc
1376d90
44c8e58
92949a3
5b22e4d
c369fc5
a32960d
48526d9
adf2a60
a272090
d3389bb
96949f7
7d0f0ec
dcc64a1
47434f9
30c9b64
e55af47
df5e01f
b334c50
c54e6bf
5f3d902
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,319 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Setup Visdom\n", | ||
"\n", | ||
"Install it with:\n", | ||
"\n", | ||
"`pip install visdom`\n", | ||
"\n", | ||
"Start the server:\n", | ||
"\n", | ||
"`python -m visdom.server`\n", | ||
"\n", | ||
"Visdom now can be accessed at http://localhost:8097 in the browser.\n", | ||
"\n", | ||
"\n", | ||
"# LDA Training Visualization\n", | ||
"\n", | ||
"To monitor the LDA training, a list of Metrics can be passed to the LDA function call for plotting their values live as the training progresses. \n", | ||
"\n", | ||
"Let's plot the training stats for an LDA model being trained on Lee corpus. We will use the four evaluation metrics available for topic models in gensim: Coherence, Perplexity, Topic diff and Convergence. (using separate hold_out and test corpus for evaluating the perplexity)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import re\n", | ||
"import gensim\n", | ||
"from gensim.models import ldamodel\n", | ||
"from gensim.corpora.dictionary import Dictionary\n", | ||
"\n", | ||
"\n", | ||
"# Set file names for train and test data\n", | ||
"test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you use large dataset for this? (download from link in notebook) |
||
"lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')\n", | ||
"lee_test_file = os.path.join(test_data_dir, 'lee.cor')\n", | ||
"\n", | ||
"def read_corpus(fname):\n", | ||
" texts = []\n", | ||
" with open(fname, encoding=\"ISO-8859-1\") as f:\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't work for python2 (because |
||
" for line in f:\n", | ||
" # lower case all words\n", | ||
" lowered = line.lower()\n", | ||
" # remove punctuation and split into seperate words\n", | ||
" words = re.compile('\\w+').findall(lowered)\n", | ||
" texts.append(words)\n", | ||
" return texts\n", | ||
"\n", | ||
"training_texts = read_corpus(lee_train_file)\n", | ||
"eval_texts = read_corpus(lee_test_file)\n", | ||
"\n", | ||
"# Split test data into hold_out and test corpus\n", | ||
"holdout_texts = eval_texts[:25]\n", | ||
"test_texts = eval_texts[25:]\n", | ||
"\n", | ||
"training_dictionary = Dictionary(training_texts)\n", | ||
"holdout_dictionary = Dictionary(holdout_texts)\n", | ||
"test_dictionary = Dictionary(test_texts)\n", | ||
"\n", | ||
"training_corpus = [training_dictionary.doc2bow(text) for text in training_texts]\n", | ||
"holdout_corpus = [holdout_dictionary.doc2bow(text) for text in holdout_texts]\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's 3 different mappings, mistake. You should fit your |
||
"test_corpus = [test_dictionary.doc2bow(text) for text in test_texts]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 18, | ||
"metadata": { | ||
"collapsed": true, | ||
"scrolled": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric\n", | ||
"\n", | ||
"# define perplexity callback for hold_out and test corpus\n", | ||
"pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"visdom\", title=\"Perplexity (hold_out)\")\n", | ||
"pl_test = PerplexityMetric(corpus=test_corpus, logger=\"visdom\", title=\"Perplexity (test)\")\n", | ||
"\n", | ||
"# define other remaining metrics available\n", | ||
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", title=\"Coherence (u_mass)\")\n", | ||
"ch_cv = CoherenceMetric(corpus=training_corpus, texts=training_texts, coherence=\"c_v\", logger=\"visdom\", title=\"Coherence (c_v)\")\n", | ||
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", title=\"Diff (kullback_leibler)\")\n", | ||
"convergence_kl = ConvergenceMetric(distance=\"kullback_leibler\", logger=\"visdom\", title=\"Convergence (kullback_leibler)\")\n", | ||
"\n", | ||
"callbacks = [pl_holdout, pl_test, ch_umass, ch_cv, diff_kl, convergence_kl]\n", | ||
"\n", | ||
"# training LDA model\n", | ||
"model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=20, num_topics=5, eval_every=None, callbacks=callbacks)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"When the model is set for training, you can open http://localhost:8097 to see the training progress." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"-0.259766196856\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# to get a metric value on a trained model\n", | ||
"print(CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\").get_value(model=model))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The four types of graphs which are plotted for LDA:\n", | ||
"\n", | ||
"**Coherence**\n", | ||
"\n", | ||
"Coherence measures are generally based on the idea of computing the sum of pairwise scores of top *n* top words w<sub>1</sub>, ...,w<sub>n</sub> used to describe the topic. There are four coherence measure available in gensim: `u_mass, c_v, c_uci, c_npmi`. A good model will generate coherent topics, i.e., topics with high topic coherence scores. Good topics can be described by a short label based on the topic terms they spit out. \n", | ||
"\n", | ||
"<img src=\"Coherence.gif\">\n", | ||
"\n", | ||
"Now, this graph along with the others explained below, can be used to decide if it's time to stop the training. We can see if the value stops changing after some epochs and that we are able to get the highest possible coherence of our model. \n", | ||
"\n", | ||
"\n", | ||
"**Perplexity**\n", | ||
"\n", | ||
"Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In LDA, topics are described by a probability distribution over vocabulary words. So, perplexity can be used to evaluate the topic-term distribution output by LDA.\n", | ||
"\n", | ||
"<img src=\"Perplexity.gif\">\n", | ||
"\n", | ||
"For a good model, perplexity should be low.\n", | ||
"\n", | ||
"\n", | ||
"**Topic Difference**\n", | ||
"\n", | ||
"Topic Diff calculates the distance between two LDA models. This distance is calculated based on the topics, by either using their probability distribution over vocabulary words (kullback_leibler, hellinger) or by simply using the common vocabulary words between the topics from both model.\n", | ||
"\n", | ||
"<img src=\"Diff.gif\">\n", | ||
"\n", | ||
"In the heatmap, X-axis define the Epoch no. and Y-axis define the distance between identical topics from consecutive epochs. For ex. a particular cell in the heatmap with values (x=3, y=5, z=0.4) represent the distance(=0.4) between the topic 5 from 3rd epoch and topic 5 from 2nd epoch. With increasing epochs, the distance between the identical topics should decrease.\n", | ||
" \n", | ||
" \n", | ||
"**Convergence**\n", | ||
"\n", | ||
"Convergence is the sum of the difference between all the identical topics from two consecutive epochs. It is basically the sum of column values in the heatmap above.\n", | ||
"\n", | ||
"<img src=\"Convergence.gif\">\n", | ||
"\n", | ||
"The model is said to be converged when the convergence value stops descending with increasing epochs." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Training Logs\n", | ||
"\n", | ||
"We can also log the metric values after every epoch to the shell apart from visualizing them in Visdom. The only change is to define `logger=\"shell\"` instead of `\"visdom\"` in the input callbacks." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add an example on how to get the values out programatically (no logging or plotting, using them from arbitrary Python code instead)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can be done by using a loop to manually iterate over model and call metric classes at the end to store value:
and sure, I'll add an example for this in notebook. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Is there a way to use the callbacks in a way that they collect this info? I'm thinking a type of "logger" that instead of logging, appends the value to some internal list. Which other parts of the app can read from. The idea is the interface would be the same as for Is that possible? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes we can store them just after they are calculated in this step. Maybe in a dict which could be an attribute of LdaModel? Structure could be:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Used |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"scrolled": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"INFO:gensim.models.ldamodel:using symmetric alpha at 0.2\n", | ||
"INFO:gensim.models.ldamodel:using symmetric eta at 0.00013900472616068947\n", | ||
"INFO:gensim.models.ldamodel:using serial LDA version on this node\n", | ||
"INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 5 topics, 3 passes over the supplied corpus of 300 documents, updating model once every 300 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000\n", | ||
"WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n", | ||
"INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #300/300\n", | ||
"INFO:gensim.models.ldamodel:topic #0 (0.200): 0.051*\"the\" + 0.020*\"in\" + 0.019*\"to\" + 0.018*\"of\" + 0.015*\"a\" + 0.013*\"and\" + 0.011*\"is\" + 0.011*\"for\" + 0.009*\"he\" + 0.008*\"says\"\n", | ||
"INFO:gensim.models.ldamodel:topic #1 (0.200): 0.048*\"the\" + 0.030*\"to\" + 0.022*\"in\" + 0.021*\"and\" + 0.020*\"a\" + 0.019*\"of\" + 0.010*\"s\" + 0.009*\"for\" + 0.008*\"that\" + 0.008*\"have\"\n", | ||
"INFO:gensim.models.ldamodel:topic #2 (0.200): 0.023*\"the\" + 0.022*\"to\" + 0.018*\"of\" + 0.015*\"in\" + 0.013*\"and\" + 0.013*\"a\" + 0.007*\"is\" + 0.007*\"on\" + 0.006*\"that\" + 0.006*\"says\"\n", | ||
"INFO:gensim.models.ldamodel:topic #3 (0.200): 0.072*\"the\" + 0.028*\"of\" + 0.024*\"to\" + 0.021*\"in\" + 0.021*\"a\" + 0.020*\"and\" + 0.010*\"he\" + 0.009*\"for\" + 0.009*\"is\" + 0.008*\"on\"\n", | ||
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.066*\"the\" + 0.024*\"to\" + 0.018*\"of\" + 0.017*\"and\" + 0.015*\"in\" + 0.014*\"a\" + 0.010*\"has\" + 0.008*\"it\" + 0.008*\"s\" + 0.008*\"is\"\n", | ||
"INFO:gensim.models.ldamodel:topic diff=2.058826, rho=1.000000\n", | ||
"INFO:gensim.models.ldamodel:Epoch 0: Perplexity (hold_out) estimate: 400318.441165\n", | ||
"INFO:gensim.models.ldamodel:Epoch 0: Perplexity (test) estimate: 1242745.72103\n", | ||
"INFO:gensim.models.ldamodel:Epoch 0: Coherence estimate: -0.254109275924\n", | ||
"INFO:gensim.models.ldamodel:Epoch 0: Diff estimate: [ 0.79628357 0.90363194 0.55469714 1. 0.86688377]\n", | ||
"INFO:gensim.models.ldamodel:Epoch 0: Convergence estimate: 4.12149642523\n", | ||
"INFO:gensim.models.ldamodel:PROGRESS: pass 1, at document #300/300\n", | ||
"INFO:gensim.models.ldamodel:topic #0 (0.200): 0.046*\"the\" + 0.019*\"in\" + 0.017*\"to\" + 0.017*\"of\" + 0.014*\"a\" + 0.012*\"and\" + 0.012*\"is\" + 0.010*\"for\" + 0.008*\"he\" + 0.008*\"says\"\n", | ||
"INFO:gensim.models.ldamodel:topic #1 (0.200): 0.048*\"the\" + 0.030*\"to\" + 0.022*\"in\" + 0.021*\"and\" + 0.019*\"a\" + 0.018*\"of\" + 0.010*\"s\" + 0.009*\"for\" + 0.008*\"have\" + 0.008*\"on\"\n", | ||
"INFO:gensim.models.ldamodel:topic #2 (0.200): 0.016*\"to\" + 0.016*\"the\" + 0.013*\"of\" + 0.010*\"in\" + 0.010*\"and\" + 0.009*\"a\" + 0.005*\"says\" + 0.005*\"that\" + 0.005*\"on\" + 0.005*\"is\"\n", | ||
"INFO:gensim.models.ldamodel:topic #3 (0.200): 0.071*\"the\" + 0.028*\"of\" + 0.025*\"to\" + 0.021*\"in\" + 0.021*\"a\" + 0.019*\"and\" + 0.010*\"he\" + 0.009*\"for\" + 0.008*\"is\" + 0.008*\"s\"\n", | ||
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.062*\"the\" + 0.025*\"to\" + 0.017*\"of\" + 0.016*\"and\" + 0.014*\"a\" + 0.014*\"in\" + 0.010*\"has\" + 0.009*\"is\" + 0.008*\"it\" + 0.007*\"s\"\n", | ||
"INFO:gensim.models.ldamodel:topic diff=0.567364, rho=0.577350\n", | ||
"INFO:gensim.models.ldamodel:Epoch 1: Perplexity (hold_out) estimate: 231516.22057\n", | ||
"INFO:gensim.models.ldamodel:Epoch 1: Perplexity (test) estimate: 666335.540876\n", | ||
"INFO:gensim.models.ldamodel:Epoch 1: Coherence estimate: -0.248792041182\n", | ||
"INFO:gensim.models.ldamodel:Epoch 1: Diff estimate: [ 0.83029118 0.72960219 1. 0.22719304 0.75709049]\n", | ||
"INFO:gensim.models.ldamodel:Epoch 1: Convergence estimate: 3.54417690778\n", | ||
"INFO:gensim.models.ldamodel:PROGRESS: pass 2, at document #300/300\n", | ||
"INFO:gensim.models.ldamodel:topic #0 (0.200): 0.043*\"the\" + 0.019*\"in\" + 0.018*\"to\" + 0.016*\"of\" + 0.012*\"a\" + 0.012*\"is\" + 0.012*\"and\" + 0.009*\"for\" + 0.008*\"says\" + 0.007*\"he\"\n", | ||
"INFO:gensim.models.ldamodel:topic #1 (0.200): 0.050*\"the\" + 0.029*\"to\" + 0.023*\"in\" + 0.021*\"and\" + 0.019*\"of\" + 0.018*\"a\" + 0.009*\"s\" + 0.009*\"for\" + 0.009*\"on\" + 0.008*\"he\"\n", | ||
"INFO:gensim.models.ldamodel:topic #2 (0.200): 0.012*\"to\" + 0.012*\"the\" + 0.009*\"of\" + 0.008*\"and\" + 0.007*\"in\" + 0.007*\"a\" + 0.004*\"says\" + 0.004*\"have\" + 0.004*\"that\" + 0.003*\"on\"\n", | ||
"INFO:gensim.models.ldamodel:topic #3 (0.200): 0.071*\"the\" + 0.027*\"of\" + 0.025*\"to\" + 0.021*\"in\" + 0.021*\"a\" + 0.019*\"and\" + 0.010*\"he\" + 0.009*\"for\" + 0.008*\"is\" + 0.008*\"s\"\n", | ||
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.061*\"the\" + 0.025*\"to\" + 0.018*\"of\" + 0.015*\"and\" + 0.015*\"a\" + 0.014*\"in\" + 0.010*\"has\" + 0.009*\"is\" + 0.008*\"says\" + 0.007*\"it\"\n", | ||
"INFO:gensim.models.ldamodel:topic diff=0.393123, rho=0.500000\n", | ||
"INFO:gensim.models.ldamodel:Epoch 2: Perplexity (hold_out) estimate: 185972.72653\n", | ||
"INFO:gensim.models.ldamodel:Epoch 2: Perplexity (test) estimate: 516819.885154\n", | ||
"INFO:gensim.models.ldamodel:Epoch 2: Coherence estimate: -0.257564279899\n", | ||
"INFO:gensim.models.ldamodel:Epoch 2: Diff estimate: [ 0.82668066 0.50774819 1. 0.19109239 0.50630086]\n", | ||
"INFO:gensim.models.ldamodel:Epoch 2: Convergence estimate: 3.03182210104\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import logging\n", | ||
"from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric\n", | ||
"\n", | ||
"logging.basicConfig(level=logging.INFO)\n", | ||
"logger = logging.getLogger(__name__)\n", | ||
"logger.setLevel(logging.DEBUG)\n", | ||
"\n", | ||
"# define perplexity callback for hold_out and test corpus\n", | ||
"pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"shell\", title=\"Perplexity (hold_out)\")\n", | ||
"pl_test = PerplexityMetric(corpus=test_corpus, logger=\"shell\", title=\"Perplexity (test)\")\n", | ||
"\n", | ||
"# define other remaining metrics available\n", | ||
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"shell\")\n", | ||
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"shell\")\n", | ||
"convergence_kl = ConvergenceMetric(distance=\"kullback_leibler\", logger=\"shell\")\n", | ||
"\n", | ||
"callbacks = [pl_holdout, pl_test, ch_umass, diff_kl, convergence_kl]\n", | ||
"\n", | ||
"# training LDA model\n", | ||
"model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=3, num_topics=5, eval_every=None, callbacks=callbacks)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The metric values can also be accessed from the model instance for custom uses." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"defaultdict(list,\n", | ||
" {'Coherence': [-0.25410927592387839,\n", | ||
" -0.24879204118159887,\n", | ||
" -0.25756427989868341],\n", | ||
" 'Convergence': [4.1214964252266926,\n", | ||
" 3.5441769077766914,\n", | ||
" 3.031822101038804],\n", | ||
" 'Diff': [array([ 0.79628357, 0.90363194, 0.55469714, 1. , 0.86688377]),\n", | ||
" array([ 0.83029118, 0.72960219, 1. , 0.22719304, 0.75709049]),\n", | ||
" array([ 0.82668066, 0.50774819, 1. , 0.19109239, 0.50630086])],\n", | ||
" 'Perplexity (hold_out)': [400318.44116470998,\n", | ||
" 231516.22056950352,\n", | ||
" 185972.72652968348],\n", | ||
" 'Perplexity (test)': [1242745.7210251174,\n", | ||
" 666335.54087631544,\n", | ||
" 516819.88515415508]})" | ||
] | ||
}, | ||
"execution_count": 10, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model.metrics" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.4.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an example with logger="shell" in notebook (and show logging output in notebook)