From 5c00027d1203d835dbda86117fa3ff765fd8fdf4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Fri, 8 Jul 2016 15:25:27 +0200 Subject: [PATCH 1/7] Added LDA tutorial. --- docs/notebooks/LDA_tutorial.ipynb | 732 ++++++++++++++++++++++++++++++ 1 file changed, 732 insertions(+) create mode 100644 docs/notebooks/LDA_tutorial.ipynb diff --git a/docs/notebooks/LDA_tutorial.ipynb b/docs/notebooks/LDA_tutorial.ipynb new file mode 100644 index 0000000000..6edafda626 --- /dev/null +++ b/docs/notebooks/LDA_tutorial.ipynb @@ -0,0 +1,732 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**TODO:**\n", + "\n", + "* Make sure there are no \"TODO\"s in text before I submit it.\n", + "* What are the \"prerequisites\" for the tutorial?\n", + "* I found my way of choosing number of iterations and passes to be effective, but is the method sound enough to include in this tutorial?\n", + "* Perhaps list some resources that people can get started on, for example tutorials on LDA in Gensim.\n", + "* iPython notebook posts in the RaRe blog are a bit problematic. It's probably better to have an introduction in the blog and link to a page with the notebook.\n", + "* The problem with using pyLDAvis in the tutorial is that it completely messes up the scale of the notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LDA tutorial\n", + "\n", + "LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n", + "\n", + "The purpose of this tutorial is to show you some tips and tricks that I have discovered when applying the LDA model to text data. This tutorial will **not** explain you the LDA model, how inference is made in the LDA model, and it will not necessarily teach you how to use Gensim's implementation. There are plenty of resources for all of those things, but what is somewhat lacking is a hands-on tutorial that helps you train an LDA model with good results... so here is my contribution towards that.\n", + "\n", + "I have used a corpus of NIPS papers in this tutorial, but if you're following this tutorial just to learn about LDA I encourage you to consider picking a corpus on a subject that you are familiar with. Qualitatively evaluating the output of an LDA model is challenging and can require you to understand the subject matter of your corpus (depending on your goal with the model).\n", + "\n", + "I would also encourage you to consider each step when applying the model to your data, instead of just blindly applying my solution. The different steps will depend on your data and possibly your goal with the model.\n", + "\n", + "In the following sections, we will go through pre-processing the data and training the model.\n", + "\n", + "> **Note:**\n", + ">\n", + "> This tutorial uses the scikit-learn and nltk libraries, although you can replace them with others if you want. Python 3 is used, although Python 2.7 can be used as well.\n", + "\n", + "In this tutorial we will:\n", + "\n", + "* Load data.\n", + "* Pre-process data.\n", + "* Transform documents to a vectorized form.\n", + "* Train an LDA model.\n", + "\n", + "## Data\n", + "\n", + "We will be using some papers from the NIPS (Neural Information Processing Systems) conference. NIPS is a machine learning conference so the subject matter should be well suited for most of the target audience of this tutorial.\n", + "\n", + "You can download the data from [Sam Roweis' website](http://www.cs.nyu.edu/~roweis/data.html).\n", + "\n", + "Note that the corpus contains 1740 documents, and not particularly long ones. So keep in mind that this tutorial is not geared towards efficiency, and be careful before applying the code to a large dataset.\n", + "\n", + "Below we are simply reading the data." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Read data.\n", + "\n", + "import os\n", + "\n", + "# Folder containing all NIPS papers.\n", + "data_dir = 'nipstxt/'\n", + "\n", + "# Folders containin individual NIPS papers.\n", + "yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']\n", + "dirs = ['nips' + yr for yr in yrs]\n", + "\n", + "# Read all texts into a list.\n", + "docs = []\n", + "for yr_dir in dirs:\n", + " files = os.listdir(data_dir + yr_dir)\n", + " for filen in files:\n", + " # Note: ignoring characters that cause encoding errors.\n", + " with open(data_dir + yr_dir + '/' + filen, errors='ignore') as fid:\n", + " txt = fid.read()\n", + " docs.append(txt)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pre-process and vectorize the documents\n", + "\n", + "Among other things, we will:\n", + "\n", + "* Split the documents into tokens.\n", + "* Lemmatize the tokens.\n", + "* Compute bigrams.\n", + "* Compute a bag-of-words representation of the data.\n", + "\n", + "First we tokenize the text using a regular expression tokenizer from NLTK. We remove numeric tokens and tokens that are only a single character, as they don't tend to be useful, and the dataset contains a lot of them." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Tokenize the documents.\n", + "\n", + "from nltk.tokenize import RegexpTokenizer\n", + "\n", + "# Split the documents into tokens.\n", + "tokenizer = RegexpTokenizer(r'\\w+')\n", + "for idx in range(len(docs)):\n", + " docs[idx] = docs[idx].lower() # Convert to lowercase.\n", + " docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words.\n", + "\n", + "# Remove numbers, but not words that contain numbers.\n", + "docs = [[token for token in doc if not token.isnumeric()] for doc in docs]\n", + "\n", + "# Remove words that are only one character.\n", + "docs = [[token for token in doc if len(token) > 1] for doc in docs]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. Output that is easy to read is very desirable in topic modelling." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Lemmatize the documents.\n", + "\n", + "from nltk.stem.wordnet import WordNetLemmatizer\n", + "\n", + "# Lemmatize all words in documents.\n", + "lemmatizer = WordNetLemmatizer()\n", + "docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We find bigrams in the documents. Bigrams are sets of two adjacent words. Using bigrams we can get phrases like \"machine_learning\" in our output (spaces are replaced with underscores); without bigrams we would only get \"machine\" and \"learning\".\n", + "\n", + "Note that in the code below, we find bigrams and then add them to the original data, because we would like to keep the words \"machine\" and \"learning\" as well as the bigram \"machine_learning\".\n", + "\n", + "Note that computing n-grams of large dataset can be very computationally intentensive and memory intensive." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Compute bigrams.\n", + "\n", + "from gensim.models import Phrases\n", + "\n", + "# Add bigrams and trigrams to docs (only ones that appear 20 times or more).\n", + "bigram = Phrases(docs, min_count=20)\n", + "for idx in range(len(docs)):\n", + " for token in bigram[docs[idx]]:\n", + " if '_' in token:\n", + " # Token is a bigram, add to document.\n", + " docs[idx].append(token)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We remove rare words and common words based on their *document frequency*. Below we remove words that appear in less than 20 documents or in more than 50% of the documents. Consider trying to remove words only based on their frequency, or maybe combining that with this approach." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Remove rare and common tokens.\n", + "\n", + "from gensim.corpora import Dictionary\n", + "\n", + "# Create a dictionary representation of the documents.\n", + "dictionary = Dictionary(docs)\n", + "\n", + "# Filter out words that occur less than 20 documents, or more than 50% of the documents.\n", + "dictionary.filter_extremes(no_below=20, no_above=0.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we transform the documents to a vectorized form. We simply compute the frequency of each word, including the bigrams." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Vectorize data.\n", + "\n", + "# Bag-of-words representation of the documents.\n", + "corpus = [dictionary.doc2bow(doc) for doc in docs]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how many tokens and documents we have to train on." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of unique tokens: 8640\n", + "Number of documents: 1740\n" + ] + } + ], + "source": [ + "print('Number of unique tokens: %d' % len(dictionary))\n", + "print('Number of documents: %d' % len(corpus))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training\n", + "\n", + "We are ready to train the LDA model. We will first discuss how to set some of the training parameters. Some of the parameters are difficult to explain if someone isn't familiar with the model and its training algorithm, but we'll give it a go. **TODO:** re-write the last sentence.\n", + "\n", + "First of all, the elephant in the room: how many topics do I need? There is really no easy answer for this, it will depend on both your data and your application. I have used 10 topics here because I wanted to have a few topics that I could interpret and \"label\", and because that turned out to give me reasonably good results. You might not need to interpret all your topics, so you could use a large number of topics, for example 100.\n", + "\n", + "The `chunksize` controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. I've set `chunksize = 2000`, which is more than the amount of documents, so I process all the data in one go. Chunksize can however influence the quality of the model, as discussed in Hoffman and co-authors [2], but the difference was not substantial in this case.\n", + "\n", + "`passes` controls how often we train the model on the entire corpus. Another word for passes might be \"epochs\". `iterations` is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of \"passes\" and \"iterations\" high enough.\n", + "\n", + "I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set `eval_every = 1`. When training the model look for a line in the log that looks something like this:\n", + "\n", + " 2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations\n", + "\n", + "If you set `passes = 20` you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.\n", + "\n", + "We set `alpha = 'auto'` and `eta = 'auto'`. Again this is somewhat technical, but essentially we are automatically learning two paramters in the model that we usually would have to specify explicitly." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 1min 43s, sys: 12.1 s, total: 1min 55s\n", + "Wall time: 1min 41s\n" + ] + } + ], + "source": [ + "# Train LDA model.\n", + "\n", + "from gensim.models import LdaModel\n", + "\n", + "# Set training parameters.\n", + "num_topics = 10\n", + "chunksize = 2000\n", + "passes = 20\n", + "iterations = 400\n", + "eval_every = None # Don't evaluate model perplexity, takes too much time.\n", + "\n", + "# Make a index to word dictionary.\n", + "temp = dictionary[0] # This is only to \"load\" the dictionary.\n", + "id2word = dictionary.id2token\n", + "\n", + "%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \\\n", + " alpha='auto', eta='auto', \\\n", + " iterations=iterations, num_topics=num_topics, \\\n", + " passes=passes, eval_every=eval_every)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now what we're going to do is train several models using the approach described above, and choose the model with lowest *perplexity*. In LDA we use perplexity as a way to measure the fit of the model." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Elapsed time: 123.4005 seconds.\n", + "Perplexity: 298.1638\n", + "\n", + "Elapsed time: 244.3692 seconds.\n", + "Perplexity: 294.8107\n", + "\n", + "Elapsed time: 366.5183 seconds.\n", + "Perplexity: 296.9103\n", + "\n", + "Elapsed time: 485.7781 seconds.\n", + "Perplexity: 295.3224\n", + "\n", + "Elapsed time: 605.9293 seconds.\n", + "Perplexity: 293.9899\n", + "\n" + ] + } + ], + "source": [ + "# Train some models to compare.\n", + "\n", + "from time import time\n", + "from numpy import exp2\n", + "\n", + "# Set training parameters.\n", + "num_topics = 10\n", + "chunksize = 2000\n", + "passes = 20\n", + "iterations = 400\n", + "eval_every = None # Don't evaluate model perplexity unless we need it, it takes too much time.\n", + "\n", + "models = [] # List of model objects and their corresponding perplexity.\n", + "start = time()\n", + "for _ in range(5):\n", + " model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \\\n", + " alpha='auto', eta='auto', \\\n", + " iterations=iterations, num_topics=num_topics, \\\n", + " passes=passes, eval_every=eval_every)\n", + " \n", + " # Compute perplexity of model.\n", + " temp = model.log_perplexity(corpus)\n", + " perplexity = exp2(-temp)\n", + " models.append((model, perplexity))\n", + " print('Elapsed time: %.4f seconds.' % (time() - start))\n", + " print('Perplexity: %.4f\\n' % (perplexity))\n", + " \n", + "# Pickle the models so we don't have to do this again.\n", + "import pickle\n", + "pickle.dump(models, open('5_models.pickle', 'wb'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We select the model with the lowest perplexity." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Perplexity: 293.9899\n" + ] + } + ], + "source": [ + "model, perplexity = min(models, key=lambda m: m[1])\n", + "print('Perplexity: %.4f' % (perplexity))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can compute the topic coherence of each topic. Below we display the average topic coherence and print the topics in order of topic coherence.\n", + "\n", + "Note that we use the \"Umass\" topic coherence measure here (see [docs](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics)), Gensim has recently obtained an implementation of the \"AKSW\" topic coherence measure (see accompanying [blog post](http://rare-technologies.com/what-is-topic-coherence/)).\n", + "\n", + "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods in the comments below!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Average topic coherence: -184.6016.\n", + "[([(0.021633213181384808, 'neuron'),\n", + " (0.010217691903217984, 'cell'),\n", + " (0.0082727364788879199, 'spike'),\n", + " (0.0075648781909228528, 'synaptic'),\n", + " (0.0074708470941849377, 'activity'),\n", + " (0.0069233960886390866, 'firing'),\n", + " (0.0065844402024269394, 'response'),\n", + " (0.0054484705201482764, 'stimulus'),\n", + " (0.0049905216429390244, 'potential'),\n", + " (0.0046648312033754366, 'dynamic'),\n", + " (0.0043872157636236165, 'phase'),\n", + " (0.0043690559219673481, 'connection'),\n", + " (0.0042000448457857166, 'fig'),\n", + " (0.0038936028866239677, 'frequency'),\n", + " (0.0038484030708212068, 'signal'),\n", + " (0.0035619675167517899, 'memory'),\n", + " (0.0035612510979155859, 'simulation'),\n", + " (0.0034699083899677372, 'delay'),\n", + " (0.003222778274768627, 'synapsis'),\n", + " (0.0030867531478369942, 'cortex')],\n", + " -157.65728612754438),\n", + " ([(0.01394508619775889, 'cell'),\n", + " (0.012233559744905821, 'visual'),\n", + " (0.0093611217350869618, 'field'),\n", + " (0.0078593492127281266, 'motion'),\n", + " (0.0075170084698202222, 'direction'),\n", + " (0.007295284912940363, 'response'),\n", + " (0.0071766768409758236, 'stimulus'),\n", + " (0.0071202187344461439, 'map'),\n", + " (0.0063611953263107597, 'orientation'),\n", + " (0.006097607837982996, 'eye'),\n", + " (0.005634328008455554, 'spatial'),\n", + " (0.0052513127782837336, 'neuron'),\n", + " (0.005147725114764439, 'receptive'),\n", + " (0.0050671662023816857, 'image'),\n", + " (0.004940498235678578, 'layer'),\n", + " (0.0047347570766813132, 'receptive_field'),\n", + " (0.0043985045727147933, 'activity'),\n", + " (0.0042108304311004987, 'cortex'),\n", + " (0.0040573413302537366, 'position'),\n", + " (0.0040119330436480831, 'region')],\n", + " -158.59627439539256),\n", + " ([(0.0080219826700949921, 'gaussian'),\n", + " (0.0068533088063575101, 'matrix'),\n", + " (0.006278805725408826, 'density'),\n", + " (0.0061736201860056669, 'mixture'),\n", + " (0.0057961560074873035, 'component'),\n", + " (0.0053104046774581212, 'likelihood'),\n", + " (0.0051456866880601679, 'estimate'),\n", + " (0.0050156142440622347, 'noise'),\n", + " (0.0047317036751591823, 'prior'),\n", + " (0.0043137703063880137, 'approximation'),\n", + " (0.0042321175243358366, 'variance'),\n", + " (0.0041232652708586724, 'bayesian'),\n", + " (0.0039598100973216067, 'em'),\n", + " (0.0038927209273342533, 'gradient'),\n", + " (0.0037779922470037221, 'log'),\n", + " (0.0037386957290049976, 'sample'),\n", + " (0.0035691130791367068, 'posterior'),\n", + " (0.003474281862935657, 'estimation'),\n", + " (0.0033300074873846915, 'regression'),\n", + " (0.0031726595216486336, 'basis')],\n", + " -166.34510865636656),\n", + " ([(0.010886162523755884, 'action'),\n", + " (0.008243305184760007, 'policy'),\n", + " (0.0062984874900367032, 'reinforcement'),\n", + " (0.0053458086160853369, 'optimal'),\n", + " (0.0044610446421877812, 'reinforcement_learning'),\n", + " (0.004307336922983125, 'memory'),\n", + " (0.0039526125909715238, 'machine'),\n", + " (0.0038936655126178108, 'control'),\n", + " (0.0037856559338613582, 'reward'),\n", + " (0.0036236160029445639, 'solution'),\n", + " (0.0036021980234376009, 'environment'),\n", + " (0.0035627734894868954, 'dynamic'),\n", + " (0.0031933344236852192, 'path'),\n", + " (0.0031378804551700011, 'graph'),\n", + " (0.003124482676950425, 'goal'),\n", + " (0.0028740266470112042, 'decision'),\n", + " (0.002859261795361839, 'iteration'),\n", + " (0.0027760542163197031, 'robot'),\n", + " (0.0027717460320510244, 'update'),\n", + " (0.0027236741499300676, 'stochastic')],\n", + " -177.26274652276632),\n", + " ([(0.0070532716721267846, 'bound'),\n", + " (0.005913824828369765, 'class'),\n", + " (0.0057276330400420862, 'let'),\n", + " (0.0054929905300656742, 'generalization'),\n", + " (0.0048778662731984385, 'theorem'),\n", + " (0.0037837387505577991, 'xi'),\n", + " (0.0037278683204615536, 'optimal'),\n", + " (0.0034766035774340242, 'threshold'),\n", + " (0.0033970822869913539, 'sample'),\n", + " (0.0032299852615058984, 'approximation'),\n", + " (0.0030406102884501475, 'dimension'),\n", + " (0.0029531703951075142, 'complexity'),\n", + " (0.0029359569608344029, 'machine'),\n", + " (0.0028217302052756486, 'loss'),\n", + " (0.0028088207050856388, 'node'),\n", + " (0.0028070556200296007, 'solution'),\n", + " (0.0027125994302985503, 'proof'),\n", + " (0.0026645852618673704, 'layer'),\n", + " (0.0026575860140346259, 'net'),\n", + " (0.0025041032721266473, 'polynomial')],\n", + " -178.98056091204177),\n", + " ([(0.013415451888852211, 'hidden'),\n", + " (0.0082374552595971748, 'hidden_unit'),\n", + " (0.0072956299376333855, 'node'),\n", + " (0.0067506029444491722, 'layer'),\n", + " (0.0067165005098781408, 'rule'),\n", + " (0.0066727349030650911, 'net'),\n", + " (0.0060021700820120441, 'tree'),\n", + " (0.0037908846621001113, 'trained'),\n", + " (0.0036697021591207916, 'sequence'),\n", + " (0.0034640024736643294, 'back'),\n", + " (0.0034239710201901612, 'table'),\n", + " (0.0033974409035990392, 'propagation'),\n", + " (0.003386765336526936, 'activation'),\n", + " (0.0030185415449297129, 'architecture'),\n", + " (0.0027794277568320121, 'learn'),\n", + " (0.0026850473390497742, 'prediction'),\n", + " (0.0026390573093651717, 'string'),\n", + " (0.0026346821217209816, 'training_set'),\n", + " (0.0025656814659620387, 'back_propagation'),\n", + " (0.0025116033411319671, 'language')],\n", + " -188.53277054717449),\n", + " ([(0.014714312788730109, 'control'),\n", + " (0.0099573280350719901, 'dynamic'),\n", + " (0.0086071341861654986, 'trajectory'),\n", + " (0.0066266453092346201, 'recurrent'),\n", + " (0.00626898358432157, 'controller'),\n", + " (0.0062674586012192932, 'sequence'),\n", + " (0.0059116541933388941, 'signal'),\n", + " (0.0057128873529593647, 'forward'),\n", + " (0.0047394595348510668, 'architecture'),\n", + " (0.0043434943047603583, 'nonlinear'),\n", + " (0.0042890468949420626, 'prediction'),\n", + " (0.0041938066166678015, 'series'),\n", + " (0.003416557849967361, 'attractor'),\n", + " (0.0032604652620072745, 'inverse'),\n", + " (0.0030363915114299377, 'trained'),\n", + " (0.0029902505602050241, 'adaptive'),\n", + " (0.0029839179665883029, 'position'),\n", + " (0.0029629507262957842, 'hidden'),\n", + " (0.0029130200205991445, 'desired'),\n", + " (0.0029018644073891733, 'feedback')],\n", + " -195.03034902242473),\n", + " ([(0.023135483496670099, 'image'),\n", + " (0.0098292320728244516, 'object'),\n", + " (0.0095091650250000437, 'recognition'),\n", + " (0.0062562002518291183, 'distance'),\n", + " (0.0053139256260932065, 'class'),\n", + " (0.0051142997138528537, 'face'),\n", + " (0.0051137601518977914, 'character'),\n", + " (0.0049003905865547294, 'classification'),\n", + " (0.0048368948400557233, 'pixel'),\n", + " (0.0045640208124919906, 'classifier'),\n", + " (0.0035795209469952948, 'view'),\n", + " (0.003322795802388204, 'digit'),\n", + " (0.0030209029835183087, 'transformation'),\n", + " (0.0028904227664521684, 'layer'),\n", + " (0.0028535003493623348, 'map'),\n", + " (0.0027609064717726449, 'human'),\n", + " (0.0027537824535155794, 'hand'),\n", + " (0.0026645898310183875, 'scale'),\n", + " (0.0026553222554916082, 'word'),\n", + " (0.0026489429534993759, 'nearest')],\n", + " -199.50365509313696),\n", + " ([(0.014619762720062554, 'circuit'),\n", + " (0.013163620299223806, 'neuron'),\n", + " (0.011173418735156853, 'chip'),\n", + " (0.010838272171877458, 'analog'),\n", + " (0.0096353380726272708, 'signal'),\n", + " (0.0080391211812232306, 'voltage'),\n", + " (0.0055581577899570835, 'channel'),\n", + " (0.0054775778900036506, 'noise'),\n", + " (0.0052134436268982988, 'vlsi'),\n", + " (0.0048225583229723852, 'bit'),\n", + " (0.0044179271415801871, 'implementation'),\n", + " (0.0042542648528188362, 'cell'),\n", + " (0.0039531536498857694, 'frequency'),\n", + " (0.0038684482611307152, 'pulse'),\n", + " (0.0036454420814875429, 'synapse'),\n", + " (0.0036188976174382961, 'threshold'),\n", + " (0.0032522992281379531, 'gate'),\n", + " (0.0032069531663885243, 'fig'),\n", + " (0.0031674367038267578, 'filter'),\n", + " (0.0031642918140801749, 'digital')],\n", + " -203.62039767672215),\n", + " ([(0.016495397244929412, 'speech'),\n", + " (0.014224555986108658, 'word'),\n", + " (0.014198504253159445, 'recognition'),\n", + " (0.0088937679870855438, 'layer'),\n", + " (0.0072520103636913641, 'classifier'),\n", + " (0.0067167934159034458, 'speaker'),\n", + " (0.0063553921838090102, 'context'),\n", + " (0.0062578159284403748, 'class'),\n", + " (0.0059853201329598928, 'hidden'),\n", + " (0.0056887097559640606, 'hmm'),\n", + " (0.0055328441630430568, 'classification'),\n", + " (0.0049471802300725225, 'trained'),\n", + " (0.0047306434471966301, 'frame'),\n", + " (0.0044173332526889391, 'phoneme'),\n", + " (0.0043987903208920357, 'mlp'),\n", + " (0.0041175501101671941, 'net'),\n", + " (0.0038161702513077969, 'acoustic'),\n", + " (0.0037067168884100422, 'speech_recognition'),\n", + " (0.0035800864004930716, 'rbf'),\n", + " (0.0035026237420255455, 'architecture')],\n", + " -220.48682168542942)]\n" + ] + } + ], + "source": [ + "top_topics = model.top_topics(corpus, num_words=20)\n", + "\n", + "# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.\n", + "avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics\n", + "print('Average topic coherence: %.4f.' % avg_topic_coherence)\n", + "\n", + "from pprint import pprint\n", + "pprint(top_topics)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) can be fun and useful. Include the code below in your notebook to visualize your topics with pyLDAvis.\n", + "\n", + " import pyLDAvis\n", + " import pyLDAvis.gensim\n", + " pyLDAvis.enable_notebook()\n", + " prepared_data = pyLDAvis.gensim.prepare(model, corpus, dictionary)\n", + " pyLDAvis.display(prepared_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Things to experiment with\n", + "\n", + "* `no_above` and `no_below` parameters in `filter_extremes` method.\n", + "* Adding trigrams or even higher order n-grams.\n", + "* Consider whether using a hold-out set or cross-validation is the way to go for you.\n", + "* Try other datasets.\n", + "\n", + "If you have other ideas, feel free to comment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Where to go from here\n", + "\n", + "* Check out a RaRe [blog post](http://rare-technologies.com/what-is-topic-coherence/) on the AKSW topic coherence measure.\n", + "* [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html).\n", + "* Read some more [Gensim tutorials](https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n", + "* If you haven't already, read [1] and [2] (see references)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "1. \"Latent Dirichlet Allocation\", Blei et al. 2003.\n", + "2. \"Online Learning for Latent Dirichlet Allocation\", Hoffman et al. 2010." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.4.3+" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 37b095f8a6069bb88a5f1018d9754d1930b9db37 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Thu, 18 Aug 2016 13:22:36 +0100 Subject: [PATCH 2/7] Updated tutorial according to some comments. Added to tutorials.md --- docs/notebooks/LDA_tutorial.ipynb | 45 ++++++++++++++++--------------- tutorials.md | 1 + 2 files changed, 25 insertions(+), 21 deletions(-) diff --git a/docs/notebooks/LDA_tutorial.ipynb b/docs/notebooks/LDA_tutorial.ipynb index 6edafda626..6dcc385b4b 100644 --- a/docs/notebooks/LDA_tutorial.ipynb +++ b/docs/notebooks/LDA_tutorial.ipynb @@ -6,19 +6,14 @@ "source": [ "**TODO:**\n", "\n", - "* Make sure there are no \"TODO\"s in text before I submit it.\n", - "* What are the \"prerequisites\" for the tutorial?\n", - "* I found my way of choosing number of iterations and passes to be effective, but is the method sound enough to include in this tutorial?\n", - "* Perhaps list some resources that people can get started on, for example tutorials on LDA in Gensim.\n", - "* iPython notebook posts in the RaRe blog are a bit problematic. It's probably better to have an introduction in the blog and link to a page with the notebook.\n", - "* The problem with using pyLDAvis in the tutorial is that it completely messes up the scale of the notebook." + "* Make sure there are no \"TODO\"s in text before I submit it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# LDA tutorial\n", + "# LDA: pre-processing and training tips\n", "\n", "LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n", "\n", @@ -32,7 +27,7 @@ "\n", "> **Note:**\n", ">\n", - "> This tutorial uses the scikit-learn and nltk libraries, although you can replace them with others if you want. Python 3 is used, although Python 2.7 can be used as well.\n", + "> This tutorial uses the nltk library, although you can replace it with something else if you want. Python 3 is used, although Python 2.7 can be used as well.\n", "\n", "In this tutorial we will:\n", "\n", @@ -41,11 +36,17 @@ "* Transform documents to a vectorized form.\n", "* Train an LDA model.\n", "\n", + "If you are not familiar with the LDA model or how to use it in Gensim, I suggest you read up on that before continuing with this tutorial. Basic understanding of the LDA model should suffice. Examples:\n", + "\n", + "* Gentle introduction to the LDA model: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/\n", + "* Gensim's LDA API documentation: https://radimrehurek.com/gensim/models/ldamodel.html\n", + "* Topic modelling in Gensim: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html\n", + "\n", "## Data\n", "\n", "We will be using some papers from the NIPS (Neural Information Processing Systems) conference. NIPS is a machine learning conference so the subject matter should be well suited for most of the target audience of this tutorial.\n", "\n", - "You can download the data from [Sam Roweis' website](http://www.cs.nyu.edu/~roweis/data.html).\n", + "You can download the data from Sam Roweis' website (http://www.cs.nyu.edu/~roweis/data.html).\n", "\n", "Note that the corpus contains 1740 documents, and not particularly long ones. So keep in mind that this tutorial is not geared towards efficiency, and be careful before applying the code to a large dataset.\n", "\n", @@ -260,7 +261,7 @@ "source": [ "## Training\n", "\n", - "We are ready to train the LDA model. We will first discuss how to set some of the training parameters. Some of the parameters are difficult to explain if someone isn't familiar with the model and its training algorithm, but we'll give it a go. **TODO:** re-write the last sentence.\n", + "We are ready to train the LDA model. We will first discuss how to set some of the training parameters.\n", "\n", "First of all, the elephant in the room: how many topics do I need? There is really no easy answer for this, it will depend on both your data and your application. I have used 10 topics here because I wanted to have a few topics that I could interpret and \"label\", and because that turned out to give me reasonably good results. You might not need to interpret all your topics, so you could use a large number of topics, for example 100.\n", "\n", @@ -268,13 +269,13 @@ "\n", "`passes` controls how often we train the model on the entire corpus. Another word for passes might be \"epochs\". `iterations` is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of \"passes\" and \"iterations\" high enough.\n", "\n", - "I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set `eval_every = 1`. When training the model look for a line in the log that looks something like this:\n", + "I suggest the following way to choose iterations and passes. First, enable logging (as described in many Gensim tutorials), and set `eval_every = 1` in `LdaModel`. When training the model look for a line in the log that looks something like this:\n", "\n", " 2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations\n", "\n", "If you set `passes = 20` you will see this line 20 times. Make sure that by the final passes, most of the documents have converged. So you want to choose both passes and iterations to be high enough for this to happen.\n", "\n", - "We set `alpha = 'auto'` and `eta = 'auto'`. Again this is somewhat technical, but essentially we are automatically learning two paramters in the model that we usually would have to specify explicitly." + "We set `alpha = 'auto'` and `eta = 'auto'`. Again this is somewhat technical, but essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly." ] }, { @@ -319,7 +320,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now what we're going to do is train several models using the approach described above, and choose the model with lowest *perplexity*. In LDA we use perplexity as a way to measure the fit of the model." + "Running the code above multiple times will produce different results, because there are some parameters in the model that are randomly initialized. Each of the trained models will have different topics that may differ in quality. So now we are going to train several models and choose the best one.\n", + "\n", + "In LDA we use perplexity as a way to measure the fit of the model. Therefore, we will choose the model with the lowest perplexity." ] }, { @@ -352,7 +355,7 @@ } ], "source": [ - "# Train some models to compare.\n", + "# Train some models and compute their perplexity.\n", "\n", "from time import time\n", "from numpy import exp2\n", @@ -362,7 +365,7 @@ "chunksize = 2000\n", "passes = 20\n", "iterations = 400\n", - "eval_every = None # Don't evaluate model perplexity unless we need it, it takes too much time.\n", + "eval_every = None # Don't evaluate model perplexity unless we need it, as it takes too much time.\n", "\n", "models = [] # List of model objects and their corresponding perplexity.\n", "start = time()\n", @@ -417,9 +420,9 @@ "source": [ "We can compute the topic coherence of each topic. Below we display the average topic coherence and print the topics in order of topic coherence.\n", "\n", - "Note that we use the \"Umass\" topic coherence measure here (see [docs](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics)), Gensim has recently obtained an implementation of the \"AKSW\" topic coherence measure (see accompanying [blog post](http://rare-technologies.com/what-is-topic-coherence/)).\n", + "Note that we use the \"Umass\" topic coherence measure here (see docs, https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics), Gensim has recently obtained an implementation of the \"AKSW\" topic coherence measure (see accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).\n", "\n", - "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods in the comments below!" + "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods in the comments below! **TODO:** the previous sentence does not make sense unless this post is on the RaRe blog; if it is, then add a hyperlink to the blog." ] }, { @@ -691,9 +694,9 @@ "source": [ "## Where to go from here\n", "\n", - "* Check out a RaRe [blog post](http://rare-technologies.com/what-is-topic-coherence/) on the AKSW topic coherence measure.\n", - "* [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html).\n", - "* Read some more [Gensim tutorials](https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n", + "* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).\n", + "* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).\n", + "* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n", "* If you haven't already, read [1] and [2] (see references)." ] }, @@ -724,7 +727,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.4.3+" + "version": "3.5.2" } }, "nbformat": 4, diff --git a/tutorials.md b/tutorials.md index 80453b6e77..25dede2089 100644 --- a/tutorials.md +++ b/tutorials.md @@ -39,6 +39,7 @@ * [Colouring words by topic in a document, print words in a topics](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb) * [Topic Coherence, a metric that correlates that human judgement on topic quality.](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb) * [Compare topics and documents using Jaccard, Kullback-Leibler and Hellinger similarities] (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/similarity_metrics.ipynb) +* [LDA: pre-processing and training tips](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/LDA_tutorial.ipynb) #####Query Similarities * Tool to get the most similar documents for LDA, LSI From d359d8ab4a5178f8b405e5c49db613d9d4fe16dd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Thu, 18 Aug 2016 13:25:20 +0100 Subject: [PATCH 3/7] Changed filename of tutorial. --- docs/notebooks/{LDA_tutorial.ipynb => lda_training_tips.ipynb} | 0 tutorials.md | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/notebooks/{LDA_tutorial.ipynb => lda_training_tips.ipynb} (100%) diff --git a/docs/notebooks/LDA_tutorial.ipynb b/docs/notebooks/lda_training_tips.ipynb similarity index 100% rename from docs/notebooks/LDA_tutorial.ipynb rename to docs/notebooks/lda_training_tips.ipynb diff --git a/tutorials.md b/tutorials.md index 25dede2089..142aba5d9b 100644 --- a/tutorials.md +++ b/tutorials.md @@ -39,7 +39,7 @@ * [Colouring words by topic in a document, print words in a topics](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb) * [Topic Coherence, a metric that correlates that human judgement on topic quality.](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb) * [Compare topics and documents using Jaccard, Kullback-Leibler and Hellinger similarities] (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/similarity_metrics.ipynb) -* [LDA: pre-processing and training tips](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/LDA_tutorial.ipynb) +* [LDA: pre-processing and training tips](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb) #####Query Similarities * Tool to get the most similar documents for LDA, LSI From 1aadcd7e98b52db1871d35caeb74794b594499c2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Thu, 1 Sep 2016 12:54:25 +0200 Subject: [PATCH 4/7] Minor changes. --- docs/notebooks/lda_training_tips.ipynb | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/docs/notebooks/lda_training_tips.ipynb b/docs/notebooks/lda_training_tips.ipynb index 6dcc385b4b..3a9779c358 100644 --- a/docs/notebooks/lda_training_tips.ipynb +++ b/docs/notebooks/lda_training_tips.ipynb @@ -4,16 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "**TODO:**\n", - "\n", - "* Make sure there are no \"TODO\"s in text before I submit it." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# LDA: pre-processing and training tips\n", + "# LDA: training tips\n", "\n", "LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n", "\n", @@ -422,7 +413,7 @@ "\n", "Note that we use the \"Umass\" topic coherence measure here (see docs, https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics), Gensim has recently obtained an implementation of the \"AKSW\" topic coherence measure (see accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).\n", "\n", - "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods in the comments below! **TODO:** the previous sentence does not make sense unless this post is on the RaRe blog; if it is, then add a hyperlink to the blog." + "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods on the blog at http://rare-technologies.com/lda-training-tips/ !" ] }, { From cac018d06e1b20a4330ff227f4cec1ba37cff39d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Thu, 1 Sep 2016 12:58:36 +0200 Subject: [PATCH 5/7] Added a link. --- docs/notebooks/lda_training_tips.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/notebooks/lda_training_tips.ipynb b/docs/notebooks/lda_training_tips.ipynb index 3a9779c358..9b0bd95af9 100644 --- a/docs/notebooks/lda_training_tips.ipynb +++ b/docs/notebooks/lda_training_tips.ipynb @@ -676,7 +676,7 @@ "* Consider whether using a hold-out set or cross-validation is the way to go for you.\n", "* Try other datasets.\n", "\n", - "If you have other ideas, feel free to comment." + "If you have other ideas, feel free to comment on http://rare-technologies.com/lda-training-tips/." ] }, { From 54f4cba496ddc56cedbdca32486ceb8e74054f21 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Wed, 5 Oct 2016 16:33:27 +0200 Subject: [PATCH 6/7] Changes according to tmylks comments. Changed title, removed first sentence, removed model selection based on perplexity. --- docs/notebooks/lda_training_tips.ipynb | 104 +------------------------ 1 file changed, 2 insertions(+), 102 deletions(-) diff --git a/docs/notebooks/lda_training_tips.ipynb b/docs/notebooks/lda_training_tips.ipynb index 9b0bd95af9..473aa86acf 100644 --- a/docs/notebooks/lda_training_tips.ipynb +++ b/docs/notebooks/lda_training_tips.ipynb @@ -4,9 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# LDA: training tips\n", - "\n", - "LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n", + "# Pre-processing and training LDA\n", "\n", "The purpose of this tutorial is to show you some tips and tricks that I have discovered when applying the LDA model to text data. This tutorial will **not** explain you the LDA model, how inference is made in the LDA model, and it will not necessarily teach you how to use Gensim's implementation. There are plenty of resources for all of those things, but what is somewhat lacking is a hands-on tutorial that helps you train an LDA model with good results... so here is my contribution towards that.\n", "\n", @@ -307,104 +305,6 @@ " passes=passes, eval_every=eval_every)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Running the code above multiple times will produce different results, because there are some parameters in the model that are randomly initialized. Each of the trained models will have different topics that may differ in quality. So now we are going to train several models and choose the best one.\n", - "\n", - "In LDA we use perplexity as a way to measure the fit of the model. Therefore, we will choose the model with the lowest perplexity." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Elapsed time: 123.4005 seconds.\n", - "Perplexity: 298.1638\n", - "\n", - "Elapsed time: 244.3692 seconds.\n", - "Perplexity: 294.8107\n", - "\n", - "Elapsed time: 366.5183 seconds.\n", - "Perplexity: 296.9103\n", - "\n", - "Elapsed time: 485.7781 seconds.\n", - "Perplexity: 295.3224\n", - "\n", - "Elapsed time: 605.9293 seconds.\n", - "Perplexity: 293.9899\n", - "\n" - ] - } - ], - "source": [ - "# Train some models and compute their perplexity.\n", - "\n", - "from time import time\n", - "from numpy import exp2\n", - "\n", - "# Set training parameters.\n", - "num_topics = 10\n", - "chunksize = 2000\n", - "passes = 20\n", - "iterations = 400\n", - "eval_every = None # Don't evaluate model perplexity unless we need it, as it takes too much time.\n", - "\n", - "models = [] # List of model objects and their corresponding perplexity.\n", - "start = time()\n", - "for _ in range(5):\n", - " model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \\\n", - " alpha='auto', eta='auto', \\\n", - " iterations=iterations, num_topics=num_topics, \\\n", - " passes=passes, eval_every=eval_every)\n", - " \n", - " # Compute perplexity of model.\n", - " temp = model.log_perplexity(corpus)\n", - " perplexity = exp2(-temp)\n", - " models.append((model, perplexity))\n", - " print('Elapsed time: %.4f seconds.' % (time() - start))\n", - " print('Perplexity: %.4f\\n' % (perplexity))\n", - " \n", - "# Pickle the models so we don't have to do this again.\n", - "import pickle\n", - "pickle.dump(models, open('5_models.pickle', 'wb'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We select the model with the lowest perplexity." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "collapsed": false - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Perplexity: 293.9899\n" - ] - } - ], - "source": [ - "model, perplexity = min(models, key=lambda m: m[1])\n", - "print('Perplexity: %.4f' % (perplexity))" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -413,7 +313,7 @@ "\n", "Note that we use the \"Umass\" topic coherence measure here (see docs, https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics), Gensim has recently obtained an implementation of the \"AKSW\" topic coherence measure (see accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).\n", "\n", - "If you are familiar with the subject of these articles, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods on the blog at http://rare-technologies.com/lda-training-tips/ !" + "If you are familiar with the subject of the articles in this dataset, you can see that the topics below make a lot of sense. However, they are not without flaws. We can see that there is substantial overlap between some topics, others are hard to interpret, and most of them have at least some terms that seem out of place. If you were able to do better, feel free to share your methods on the blog at http://rare-technologies.com/lda-training-tips/ !" ] }, { From c8c1e087eca24e1cace6d4c303183ade65ad6e0a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=93lavur=20Mortensen?= Date: Thu, 6 Oct 2016 12:10:34 +0200 Subject: [PATCH 7/7] Changes according to tmylks comments. Changed first sentence, removed stuff about pyLDAvis. --- docs/notebooks/lda_training_tips.ipynb | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/docs/notebooks/lda_training_tips.ipynb b/docs/notebooks/lda_training_tips.ipynb index 473aa86acf..3987ffe1a2 100644 --- a/docs/notebooks/lda_training_tips.ipynb +++ b/docs/notebooks/lda_training_tips.ipynb @@ -6,7 +6,7 @@ "source": [ "# Pre-processing and training LDA\n", "\n", - "The purpose of this tutorial is to show you some tips and tricks that I have discovered when applying the LDA model to text data. This tutorial will **not** explain you the LDA model, how inference is made in the LDA model, and it will not necessarily teach you how to use Gensim's implementation. There are plenty of resources for all of those things, but what is somewhat lacking is a hands-on tutorial that helps you train an LDA model with good results... so here is my contribution towards that.\n", + "The purpose of this tutorial is to show you how to pre-process text data, and how to train the LDA model on that data. This tutorial will **not** explain you the LDA model, how inference is made in the LDA model, and it will not necessarily teach you how to use Gensim's implementation. There are plenty of resources for all of those things, but what is somewhat lacking is a hands-on tutorial that helps you train an LDA model with good results... so here is my contribution towards that.\n", "\n", "I have used a corpus of NIPS papers in this tutorial, but if you're following this tutorial just to learn about LDA I encourage you to consider picking a corpus on a subject that you are familiar with. Qualitatively evaluating the output of an LDA model is challenging and can require you to understand the subject matter of your corpus (depending on your goal with the model).\n", "\n", @@ -552,19 +552,6 @@ "pprint(top_topics)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) can be fun and useful. Include the code below in your notebook to visualize your topics with pyLDAvis.\n", - "\n", - " import pyLDAvis\n", - " import pyLDAvis.gensim\n", - " pyLDAvis.enable_notebook()\n", - " prepared_data = pyLDAvis.gensim.prepare(model, corpus, dictionary)\n", - " pyLDAvis.display(prepared_data)" - ] - }, { "cell_type": "markdown", "metadata": {},