-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
4,646 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,179 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# DTM Example\n", | ||
"\n", | ||
"In this example we will present a sample usage of the DTM wrapper. \n", | ||
"\n", | ||
"The example dataset is already processed into list of tokens.\n", | ||
"To see how to get a dataset to this stage please take a look at [Gensim Tutorials](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors). The corpus contains 10 documents. The word dictionary is also already generated, but the wrapper will generate one if you supply a `None`.\n", | ||
"\n", | ||
"```python\n", | ||
" corpus = [[\"token1\",\"token3\"],[\"token1\",\"token2\"]]\n", | ||
"```\n", | ||
"\n", | ||
"`time_seq` contains the time slice defentiion. For this case it is a list `[3,7]`, which states that the first time slice contains the first 3 documents of the corpus and the second timeslice contains the next 7 documents. I admit this is a bit weird, but ya well." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import logging\n", | ||
"import gensim\n", | ||
"import os\n", | ||
"from gensim import corpora, utils\n", | ||
"from gensim.models.wrappers import dtmmodel\n", | ||
"import numpy as np\n", | ||
" \n", | ||
" \n", | ||
" # Load an alternate model if you wish\n", | ||
"\n", | ||
"\n", | ||
"# So now we have to generate the path to DTM executable, here I have\n", | ||
"# already set an ENV variable for the DTM_HOME\n", | ||
"\n", | ||
"dtm_home = os.environ.get(\n", | ||
" 'DTM_HOME', \"C:/Users/arti/repos/TopicModels/dtm-master\")\n", | ||
"dtm_path = os.path.join(dtm_home, 'bin', 'dtm') if dtm_home else None\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First we wil setup logging" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"logger = logging.getLogger()\n", | ||
"logger.setLevel(logging.DEBUG)\n", | ||
"logging.debug(\"test\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"A simple corups wrapper and load a premade corpus. You can exchange this with your own data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"class DTMcorpus(corpora.textcorpus.TextCorpus):\n", | ||
"\n", | ||
" def get_texts(self):\n", | ||
" return self.input\n", | ||
"\n", | ||
" def __len__(self):\n", | ||
" return len(self.input)\n", | ||
"corpus, time_seq = utils.unpickle('dtm_test')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So now we have to generate the path to DTM executable, here I have already set an ENV variable for the DTM_HOME" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"dtm_home = os.environ.get(\n", | ||
" 'DTM_HOME', \"C:/Users/arti/repos/TopicModels/dtm-master\")\n", | ||
"dtm_path = os.path.join(dtm_home, 'bin', 'dtm') if dtm_home else None\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"That is basically all we need to be able to invoke the Training. Currently there seems to be a bug, where \n", | ||
"it still wants to initialize from and LDA model even though we tell it not to by default. So as workaround specify\n", | ||
"\n", | ||
"```python\n", | ||
" initialize_lda=True\n", | ||
"```\n", | ||
"I'll see why this is the case in the neart future, I promise.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": false, | ||
"scrolled": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"model = DtmModel(dtm_path,corpus,time_seq,num_topics=2,id2word=corpus.dictionary,initialize_lda=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"If everything worked we should be able to print out the topics" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"topics = model.show_topics(topics=2,times=2, topn=10)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 2", | ||
"language": "python", | ||
"name": "python2" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.