From f12a277cd5208615b2b2fc2581250f66d0984c99 Mon Sep 17 00:00:00 2001 From: TheFlash10 Date: Tue, 20 Feb 2018 23:33:44 +0530 Subject: [PATCH] Removed the Deprecated parameter warning in the notebook (doc2vec-lee.ipynb) --- docs/notebooks/doc2vec-lee.ipynb | 318 +++++++++---------------------- 1 file changed, 86 insertions(+), 232 deletions(-) diff --git a/docs/notebooks/doc2vec-lee.ipynb b/docs/notebooks/doc2vec-lee.ipynb index aaeca5e224..9865096cdc 100644 --- a/docs/notebooks/doc2vec-lee.ipynb +++ b/docs/notebooks/doc2vec-lee.ipynb @@ -2,10 +2,7 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "# Doc2Vec Tutorial on the Lee Dataset" ] @@ -13,11 +10,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "collapsed": true, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "import gensim\n", @@ -29,10 +22,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## What is it?\n", "\n", @@ -41,10 +31,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Resources\n", "\n", @@ -57,20 +44,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Getting Started" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "To get going, we'll need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus. \n", "\n", @@ -83,11 +64,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "# Set file names for train and test data\n", @@ -98,20 +75,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Define a Function to Read and Preprocess Text" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Below, we define a function to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given file (aka corpus), each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number." ] @@ -119,11 +90,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "def read_corpus(fname, tokens_only=False):\n", @@ -139,11 +106,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [], "source": [ "train_corpus = list(read_corpus(lee_train_file))\n", @@ -152,10 +115,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's take a look at the training corpus" ] @@ -163,17 +123,13 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[TaggedDocument(words=[u'hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'southern', u'highlands', u'of', u'new', u'south', u'wales', u'as', u'strong', u'winds', u'today', u'pushed', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'hill', u'top', u'new', u'blaze', u'near', u'goulburn', u'south', u'west', u'of', u'sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'hume', u'highway', u'at', u'about', u'pm', u'aedt', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'blue', u'mountains', u'forced', u'authorities', u'to', u'make', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'hill', u'top', u'in', u'the', u'new', u'south', u'wales', u'southern', u'highlands', u'an', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'mittagong', u'the', u'new', u'south', u'wales', u'rural', u'fire', u'service', u'says', u'the', u'weather', u'conditions', u'which', u'caused', u'the', u'fire', u'to', u'burn', u'in', u'finger', u'formation', u'have', u'now', u'eased', u'and', u'about', u'fire', u'units', u'in', u'and', u'around', u'hill', u'top', u'are', u'optimistic', u'of', u'defending', u'all', u'properties', u'as', u'more', u'than', u'blazes', u'burn', u'on', u'new', u'year', u'eve', u'in', u'new', u'south', u'wales', u'fire', u'crews', u'have', u'been', u'called', u'to', u'new', u'fire', u'at', u'gunning', u'south', u'of', u'goulburn', u'while', u'few', u'details', u'are', u'available', u'at', u'this', u'stage', u'fire', u'authorities', u'says', u'it', u'has', u'closed', u'the', u'hume', u'highway', u'in', u'both', u'directions', u'meanwhile', u'new', u'fire', u'in', u'sydney', u'west', u'is', u'no', u'longer', u'threatening', u'properties', u'in', u'the', u'cranebrook', u'area', u'rain', u'has', u'fallen', u'in', u'some', u'parts', u'of', u'the', u'illawarra', u'sydney', u'the', u'hunter', u'valley', u'and', u'the', u'north', u'coast', u'but', u'the', u'bureau', u'of', u'meteorology', u'claire', u'richards', u'says', u'the', u'rain', u'has', u'done', u'little', u'to', u'ease', u'any', u'of', u'the', u'hundred', u'fires', u'still', u'burning', u'across', u'the', u'state', u'the', u'falls', u'have', u'been', u'quite', u'isolated', u'in', u'those', u'areas', u'and', u'generally', u'the', u'falls', u'have', u'been', u'less', u'than', u'about', u'five', u'millimetres', u'she', u'said', u'in', u'some', u'places', u'really', u'not', u'significant', u'at', u'all', u'less', u'than', u'millimetre', u'so', u'there', u'hasn', u'been', u'much', u'relief', u'as', u'far', u'as', u'rain', u'is', u'concerned', u'in', u'fact', u'they', u've', u'probably', u'hampered', u'the', u'efforts', u'of', u'the', u'firefighters', u'more', u'because', u'of', u'the', u'wind', u'gusts', u'that', u'are', u'associated', u'with', u'those', u'thunderstorms'], tags=[0]),\n", - " TaggedDocument(words=[u'indian', u'security', u'forces', u'have', u'shot', u'dead', u'eight', u'suspected', u'militants', u'in', u'night', u'long', u'encounter', u'in', u'southern', u'kashmir', u'the', u'shootout', u'took', u'place', u'at', u'dora', u'village', u'some', u'kilometers', u'south', u'of', u'the', u'kashmiri', u'summer', u'capital', u'srinagar', u'the', u'deaths', u'came', u'as', u'pakistani', u'police', u'arrested', u'more', u'than', u'two', u'dozen', u'militants', u'from', u'extremist', u'groups', u'accused', u'of', u'staging', u'an', u'attack', u'on', u'india', u'parliament', u'india', u'has', u'accused', u'pakistan', u'based', u'lashkar', u'taiba', u'and', u'jaish', u'mohammad', u'of', u'carrying', u'out', u'the', u'attack', u'on', u'december', u'at', u'the', u'behest', u'of', u'pakistani', u'military', u'intelligence', u'military', u'tensions', u'have', u'soared', u'since', u'the', u'raid', u'with', u'both', u'sides', u'massing', u'troops', u'along', u'their', u'border', u'and', u'trading', u'tit', u'for', u'tat', u'diplomatic', u'sanctions', u'yesterday', u'pakistan', u'announced', u'it', u'had', u'arrested', u'lashkar', u'taiba', u'chief', u'hafiz', u'mohammed', u'saeed', u'police', u'in', u'karachi', u'say', u'it', u'is', u'likely', u'more', u'raids', u'will', u'be', u'launched', u'against', u'the', u'two', u'groups', u'as', u'well', u'as', u'other', u'militant', u'organisations', u'accused', u'of', u'targetting', u'india', u'military', u'tensions', u'between', u'india', u'and', u'pakistan', u'have', u'escalated', u'to', u'level', u'not', u'seen', u'since', u'their', u'war'], tags=[1])]" + "[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]),\n", + " TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]" ] }, "execution_count": 5, @@ -187,10 +143,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "And the testing corpus looks like this:" ] @@ -198,17 +151,13 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[[u'the', u'national', u'executive', u'of', u'the', u'strife', u'torn', u'democrats', u'last', u'night', u'appointed', u'little', u'known', u'west', u'australian', u'senator', u'brian', u'greig', u'as', u'interim', u'leader', u'shock', u'move', u'likely', u'to', u'provoke', u'further', u'conflict', u'between', u'the', u'party', u'senators', u'and', u'its', u'organisation', u'in', u'move', u'to', u'reassert', u'control', u'over', u'the', u'party', u'seven', u'senators', u'the', u'national', u'executive', u'last', u'night', u'rejected', u'aden', u'ridgeway', u'bid', u'to', u'become', u'interim', u'leader', u'in', u'favour', u'of', u'senator', u'greig', u'supporter', u'of', u'deposed', u'leader', u'natasha', u'stott', u'despoja', u'and', u'an', u'outspoken', u'gay', u'rights', u'activist'], [u'cash', u'strapped', u'financial', u'services', u'group', u'amp', u'has', u'shelved', u'million', u'plan', u'to', u'buy', u'shares', u'back', u'from', u'investors', u'and', u'will', u'raise', u'million', u'in', u'fresh', u'capital', u'after', u'profits', u'crashed', u'in', u'the', u'six', u'months', u'to', u'june', u'chief', u'executive', u'paul', u'batchelor', u'said', u'the', u'result', u'was', u'solid', u'in', u'what', u'he', u'described', u'as', u'the', u'worst', u'conditions', u'for', u'stock', u'markets', u'in', u'years', u'amp', u'half', u'year', u'profit', u'sank', u'per', u'cent', u'to', u'million', u'or', u'share', u'as', u'australia', u'largest', u'investor', u'and', u'fund', u'manager', u'failed', u'to', u'hit', u'projected', u'per', u'cent', u'earnings', u'growth', u'targets', u'and', u'was', u'battered', u'by', u'falling', u'returns', u'on', u'share', u'markets']]\n" + "[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]\n" ] } ], @@ -218,75 +167,52 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Notice that the testing corpus is just a list of lists and does not contain any tags." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Training the Model" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "### Instantiate a Doc2Vec Object " ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes." ] }, { "cell_type": "code", - "execution_count": 7, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 8, + "metadata": {}, "outputs": [], "source": [ - "model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)" + "model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=55)" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "### Build a Vocabulary" ] }, { "cell_type": "code", - "execution_count": 8, - "metadata": { - "collapsed": true, - "deletable": true, - "editable": true - }, + "execution_count": 9, + "metadata": {}, "outputs": [], "source": [ "model.build_vocab(train_corpus)" @@ -294,20 +220,14 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`)." ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "### Time to Train\n", "\n", @@ -317,117 +237,84 @@ }, { "cell_type": "code", - "execution_count": 9, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 11, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 6.34 s, sys: 109 ms, total: 6.45 s\n", - "Wall time: 2.58 s\n" + "CPU times: user 4.5 s, sys: 247 ms, total: 4.75 s\n", + "Wall time: 2.04 s\n" ] - }, - { - "data": { - "text/plain": [ - "2347665" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ - "%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)" + "%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "### Inferring a Vector" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the `model.infer_vector` function. This vector can then be compared with other vectors via cosine similarity." ] }, { "cell_type": "code", - "execution_count": 10, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 12, + "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "array([ 0.02664499, 0.00475204, -0.03981256, 0.03796276, -0.03206162,\n", - " 0.10963056, -0.04897128, 0.00151982, -0.03258783, 0.04711508,\n", - " -0.00667155, -0.08523653, -0.02975186, 0.00166316, 0.01915652,\n", - " -0.03415785, -0.05794788, 0.05110953, 0.01623618, -0.00512495,\n", - " -0.06385455, -0.0151557 , 0.00365376, 0.03015811, 0.0229462 ,\n", - " 0.03176891, 0.01117626, -0.00743352, 0.02030453, -0.05072152,\n", - " -0.00498496, 0.00151227, 0.06122205, -0.01811385, -0.01715777,\n", - " 0.04883198, 0.03925886, -0.03568915, 0.00805744, 0.01654406,\n", - " -0.05160677, 0.0119908 , -0.01527433, 0.02209963, -0.10316766,\n", - " -0.01069367, -0.02432527, 0.00761799, 0.02763799, -0.04288232], dtype=float32)" + "array([ 0.03101196, 0.08118944, 0.10724881, -0.16268663, -0.12030419,\n", + " 0.07530276, -0.05967962, 0.01093007, 0.01722554, -0.16849394,\n", + " -0.09248347, 0.00667514, 0.05426382, -0.0725852 , 0.09535281,\n", + " -0.12534387, 0.08636193, -0.1029434 , -0.07632427, -0.24741814,\n", + " -0.1277334 , -0.09834807, -0.12880586, -0.07720284, -0.12248702,\n", + " -0.15788661, 0.17826575, -0.12920539, 0.02845461, -0.12751418,\n", + " 0.06129557, -0.02319777, 0.11814108, -0.08767211, -0.04094559,\n", + " -0.00681656, 0.00937355, 0.02168806, -0.03686712, 0.14234844,\n", + " -0.01192134, 0.06787674, -0.25467244, -0.22923732, -0.03031967,\n", + " -0.2362234 , 0.1105942 , 0.01180398, 0.01921744, -0.07667527],\n", + " dtype=float32)" ] }, - "execution_count": 10, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "model.infer_vector(['only', 'you', 'can', 'prevent', 'forrest', 'fires'])" + "model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Assessing Model" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents. " ] }, { "cell_type": "code", - "execution_count": 11, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 13, + "metadata": {}, "outputs": [], "source": [ "ranks = []\n", @@ -443,31 +330,25 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Let's count how each document ranks with respect to the training corpus " ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 14, "metadata": { - "collapsed": false, - "deletable": true, - "editable": true, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ - "Counter({0: 289, 1: 11})" + "Counter({0: 284, 1: 13, 2: 2, 4: 1})" ] }, - "execution_count": 12, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -478,10 +359,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. the checking of an inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value.\n", "\n", @@ -491,11 +369,7 @@ { "cell_type": "code", "execution_count": 15, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -505,11 +379,11 @@ "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):\n", "\n", - "MOST (83, 0.9967287182807922): «the opposition leader simon crean says child abuse scandal in brisbane has damaged the office of the governor general and its incumbent dr peter hollingworth child advocates have called on dr hollingworth to step down as governor general saying he did not do enough to prevent abuse of children in an anglican school when he was archbishop of brisbane mr crean says he is not calling on dr hollingworth to resign but he says there are still unanswered questions think it has tarnished the office of the governor general the fact that it took so long for this statement to come out he said many people have been calling for it me included think if we are to avoid further damage to the office we need to clear it up completely brisbane lord mayor says the governor general explanation of his handling of child sex abuse allegations at queensland school raises more questions than it answers jim soorley who is former catholic priest says the explanation does not wash within the christian tradition bishops are regarded as shepherds he said it very clear that he was not good shepherd and there are serious consequences for that think his actions are not the actions of good shepherd and think there are still questions to be answered»\n", + "MOST (299, 0.8637137413024902): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»\n", "\n", - "MEDIAN (252, 0.9926513433456421): «the labor party is set to have wide ranging review of its structures with frontbencher martin ferguson pushing for the process the new labor leader simon crean is taking set of proposals to next thursday national executive meeting mr ferguson wants the meeting to call review he says suggestions for party changes such as the call by frontbencher joel fitzgibbon for the scrapping of new south wales rule forcing labor members to belong to union should be dealt with internally perhaps the time has come for us to actually sign up to federal executive process which actually enables debate to go forward in highly constructive way rather than individual proposals being put out there without an end game in sight mr ferguson said he says he is happy for the process to include looking at abandoning the rule but says scrapping the rule would only be minor factor in the party self examination for long time ve believed there is requirement for the labor party to actually have hard look at whether or not an archaic formula of union representation is the key to our future»\n", + "MEDIAN (178, 0.2800390124320984): «year old middle eastern woman is said to be responding well to treatment after being diagnosed with typhoid in temporary holding centre on remote christmas island it could be hours before tests can confirm whether the disease has spread further two of the woman three children boy aged and year old girl have been quarantined with their mother in the christmas island hospital third child remains at the island sports hall where locals say conditions are crowded and hot all detainees on christmas island are being monitored by health team for signs of fever or abdominal pains the key symptoms of typhoid which is spread by contact with contaminated food or water hygiene measures have also been stepped up the western australian health department is briefing medical staff on infection control procedures but locals have expressed concern the disease could spread to the wider community»\n", "\n", - "LEAST (250, 0.9795532822608948): «israel launched massive air raids across the west bank and gaza tuesday piling pressure on yasser arafat with rocket strike on police post next to his offices after prime minister ariel sharon branded his administration sponsor of terrorism israeli warplanes launched series of strikes on gaza city while apache helicopters fired rockets on palestinian security offices in khan yunis in the southern gaza strip and on the west bank towns of salfit and tulkarem they also fired missiles on security post just metres from mr arafat offices in ramallah but the palestinian leader who was in his office at the time was unhurt but two policemen were slightly wounded officials said israeli army spokesman brigadier general ron kitrey said mr arafat was not targeted two people were killed in the gaza strikes and around injured half of them schoolboys palestinian hospital officials said the attacks came as israel foreign minister shimon peres said he did not believe israeli forces would take direct action against the palestinian leader the strikes also came day after mr sharon furious that mr arafat had not stopped hardline islamic groups who killed two dozen israelis in devastating suicide attacks at the weekend ordered his forces to blast symbols of mr arafat power gunships destroyed mr arafat three helicopters in gaza city while bulldozers ploughed up the runway at gaza international airport used by mr arafat for his frequent travels abroad palestinian officials called mr sharon campaign an attempt to topple mr arafat and destroy his self rule palestinian authority mr arafat told cnn television that mr sharon was trying to torpedo his own crackdown on terrorism with the airstrikes he doesn want me to succeed and for this he is escalating his military activities against our towns our cities our establishments the palestinian leader said french foreign minister hubert vedrine accused israel of conducting deliberate policy aimed at eliminating mr arafat arafat has been weakened by the harassment of the israeli army and as result people are using his weakness as an argument to say that since he can not re establish order in his own camp he should in some way be eliminated however britain prime minister tony blair and us president george bush expressed sympathy with israel and called on all sides to do anything they can to stabilise the situation mr sharon hard words and air strikes opened major divisions in his cross party government with left wing mr peres denouncing what he called bid during monday emergency cabinet meeting to cause the downfall of the palestinian authority the region had been braced for huge israeli retaliation after three palestinian suicide bombers from the hardline islamic movement hamas killed people on saturday and sunday in the suicide attacks in jerusalem and haifa mr sharon made national address after blasting gaza city and jenin in the west bank on monday accusing mr arafat of having chosen the path of terrorism and being the greatest obstacle to peace and stability in the middle east mr peres said the move by mr sharon dominant right wingers in effect means israeli policy is based purely on force with no political hope public radio said mr peres had called all the ministers from his labour party for special meeting wednesday to discuss the fallout of the strikes and mr sharon accusation that mr arafat was responsible for everything that has happened here chief palestinian negotiator saeb erakat speaking after mr sharon speech monday evening said the words amounted to declaration of war he called on the united states and europe to rein in mr sharon and dispatch international observers to oversee the spiralling conflict»\n", + "LEAST (11, 0.01867116428911686): «peru has entered two days of official mourning for the more than people killed in fire that destroyed part of downtown lima police say the fire began when fireworks cache exploded in shop just four blocks from peru congress in heritage listed area famed for its spanish colonial era architecture early evening crowds buying traditional fireworks for new year eve celebrations were trapped by the flames as they raced through surrounding markets and four storey apartment buildings local residents blame vendors of illegal fireworks and say the death toll was exacerbated by poor traffic control in the adjoining narrow street where cars themselves engulfed by fire trapped fleeing victims hospitals have urged the public to donate medicine for the hundreds of burns victims peru president alejandro toledo has cut short his beach holiday to oversee an inquiry»\n", "\n" ] } @@ -523,30 +397,23 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself" ] }, { "cell_type": "code", - "execution_count": 14, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 16, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Train Document (193): «new study shows that nearly one third of the aboriginal and torres strait islander population in australia have been arrested in the past five years the study conducted by the australian national university for the new south wales bureau of crime statistics is the first to compare the arrest rates of the aboriginal and non aboriginal population it finds that unemployment alcohol and assault rates were the main causes study author boyd hunter says policy both on community and government level must deal with these issues if the arrest rate is to be decreased addressing the supply of alcohol in remote communities is seen as the most likely avenue for reducing rates of abuse alcohol abuse and hence reduce arrest rates in those communities he said»\n", + "Train Document (186): «united nationals secretary general kofi annan has accepted the nobel peace prize in the norwegian capital oslo declaring that to save one life is to save humanity itself mr annan told gala audience the world must respect the individual whose fundamental rights he says have been sacrificed too often for the good of the state the year old un chief native of ghana shares this year th nobel peace prize with the united nations as whole his award was for bringing new life to the world body in his fight for human rights and against aids and terrorism»\n", "\n", - "Similar Document (125, 0.5374129414558411): «the united states space shuttle endeavour has touched down at florida kennedy space centre after day mission bringing home crew that had been on the international space station since august the shuttle carrying outgoing space station commander frank culbertson and russian cosmonauts vladimir dezhurov and mikhail tyurin along with four other astronauts landed at pm local time taking over from the trio are russian commander yuri onufrienko and us astronauts carl walz and dan bursch who travelled to the station aboard endeavour on december earlier monday the seven us and russian astronauts on board endeavour woke up to the tune please come home for christmas by the rock group bon jovi on sunday the endeavour crew deployed small satellite called starshine from canister located in the shuttle payload bay more than students from countries will track the satellite as it orbits earth for the next eight months the students will collect information in order to calculate the density of the upper atmosphere nasa said on saturday endeavour undocked from the space station after making last minute maneuver to dodge piece of soviet era space refuse the endeavour mission the th shuttle trip to the international space station brought some three tonnes of equipment and materials for scientific experiments to the station the trip carried out under extremely tight security was the first since the september attacks on the united states»\n", + "Similar Document (207, 0.6752535104751587): «geoff huegill has continued his record breaking ways at the world cup short course swimming in melbourne bettering the australian record in the metres butterfly huegill beat fellow australian michael klim backing up after last night setting world record in the metres butterfly»\n", "\n" ] } @@ -563,46 +430,36 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "## Testing the Model" ] }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye." ] }, { "cell_type": "code", - "execution_count": 15, - "metadata": { - "collapsed": false, - "deletable": true, - "editable": true - }, + "execution_count": 17, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Test Document (25): «an islamic high court in northern nigeria rejected an appeal today by single mother sentenced to be stoned to death for having sex out of wedlock clutching her baby daughter amina lawal burst into tears as the judge delivered the ruling lawal was first sentenced in march after giving birth to daughter more than nine months after divorcing»\n", + "Test Document (23): «china said sunday it issued new regulations controlling the export of missile technology taking steps to ease concerns about transferring sensitive equipment to middle east countries particularly iran however the new rules apparently do not ban outright the transfer of specific items something washington long has urged beijing to do»\n", "\n", - "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/s,d50,hs,w8,mc2):\n", + "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):\n", "\n", - "MOST (6, 0.44577494263648987): «the united states team of monica seles and jan michael gambill scored decisive victory over unseeded france in their first hopman cup match at burswood dome in perth the pair runners up in the million dollar mixed teams event last year both won their singles encounters to give the us an unbeatable lead the year old seles currently ranked eighth recovered from shaky start to overpower virginie razzano who is ranked nd seles had to fight hard to get home in straight sets winning in minutes then the year old gambill ranked st wore down determined arnaud clement th to win in minutes the americans are aiming to go one better than last year when they were beaten by swiss pair martina hingis and roger federer in the final of the eight nation contest gambill said the win was great way to start the tennis year got little tentative at the end but it was great start to my year he said arnaud is great scrapper and am delighted to beat him even though am frankly bit out of shape that is one of the reasons am here will be in shape by the end of the tournament just aim to keep improving in the new year and if do think have chance to beat anyone when am playing well gambill was pressed hard by clement before taking the first set in minutes but the american gained the ascendancy in the second set breaking in the third and fifth games seles said she had expected her clash with razzano to be tough she was top junior player in the world so it was no surprise that she fought so well she said seles said she still had the hunger to strive to regain her position at the top of her sport this is why you play she said but want to try not to peak too early this season seles slow into her stride slipped to in her opening set against razzano but recovered quickly claiming the set after snatching four games in row in the second set seles broke her opponent in the opening game and completed victory with relative ease despite razzano tenacious efforts»\n", + "MOST (265, 0.42007705569267273): «the federal government is under fire from unions over new departmental report which recommends australia outsource information technology it to india the document says india has low cost skilled workforce the minister for foreign affairs and trade alexander downer has given his support to the document from his department entitled india new economy old economy the report says sectors like it finance and offer attractive direct investment opportunities it also says australian firms could become more competitive by outsourcing to the indian it sector the community and public sector union wendy caird says the government seems to be encouraging local companies to export jobs to india think that quite alarming obviously labour is great deal cheaper in india and that assisted by the indian government removing labour laws and bankruptcy laws ms caird said the union says while the initiative may create jobs in india it will not help australia rising unemployment»\n", "\n", - "MEDIAN (9, 0.07038116455078125): «some roads are closed because of dangerous conditions caused by bushfire smoke motorists are being asked to avoid the hume highway between picton road and the illawarra highway where police have reduced the speed limit from kilometres an hour to in southern sydney picton road is closed between wilton and bulli appin road is closed from appin to bulli tops and all access roads to royal national park are closed motorists are also asked to avoid the illawarra highway between the hume highway and robertson and the great western highway between penrith and springwood because of reduced visibility in north western sydney only local residents are allowed to use wisemans ferry road and upper color road under police escort»\n", + "MEDIAN (257, 0.11956833302974701): «hundreds of fans stood vigil today for the immersion of george harrison ashes into the ganges river at the hindu holy city of benares but officials and sect leaders remained tightlipped on when or where last rites for the former beatle long time devotee of the hindu hare krishna sect would take place he was closely attached to benares where devout hindus come to scatter the ashes of their dead relatives in the ganges in ritual symbolising the journey of the soul towards eternal salvation the beatles former lead guitarist died on thursday of cancer aged amid chants and prayers of hare krishna devotees who were at his bedside according to details of the ceremony released by members of the hare krishna movement yesterday harrison widow olivia accompanied by son dhani were to scatter some of the ashes early this morning in discreet ceremony at hinduism holy river some of harrison ashes could also be immersed in the ganges at allahabad another holy spot for devout hindus about kilometres upstream from benares spokesman for the hare krishna group said tomorrow harrison family members were supposed to take part in special prayer meeting in vrindavan the birthplace of lord krishna km north of the indian capital the news brought hundreds of journalists fans and curious onlookers to benares odd ghats platforms or steps from which the ashes are strewn into the river this morning but as the day wore on local administration officials and hare krishna devotees in benares refused to confirm when and where along the ganges the ceremony would take place»\n", "\n", - "LEAST (180, -0.4328160285949707): «australia has linked million of aid to new agreement with nauru to accept an extra asylum seekers the deal means nauru will take up to asylum seekers under australia pacific solution foreign minister alexander downer signed the understanding today with nauru president rene harris mr downer inspected the nauru camps and says they are are practical and efficient had good look at the sanitation the ablution blocks and thought they were pretty good he said the asylum seekers have various things to do there are volleyball facilities and soccer facilities television is available they can see different channels on tv the catering is good there are three meals day provided»\n", + "LEAST (267, -0.22617124021053314): «israeli prime minister ariel sharon has opened an emergency security cabinet meeting after placing blame for recent suicide attacks squarely on palestinian leader yasser arafat called an urgent meeting of the heads of all the security systems and very shortly the government will hold special session the government will meet in order to make decisions about how to deal further with terrorism he said in national address on public television the government was to discuss its policy on the palestinian authority which mr sharon implied was the enemy of the jewish state and should bear the consequences those who rise up against us to kill us are responsible for their own destruction he said in statement interpreted by palestinian official as call for war arafat has made his strategic choices strategy of terrorism in choosing to try to win political accomplishments through murder and in choosing to allow the ruthless killing of civilians arafat has chosen the path of terrorism mr sharon said the government represents practically the whole of the israel public and we have the paramount goal and need for unity in order to cope with all the brutalities facing us he added tonight we heard declaration of war said chief palestinian negotiator saeb erakat on cnn television sharon has chosen the path of darkness even before his address israeli helicopters and warplanes attacked targets in the west bank and gaza strip including arafat offices and police headquarters in jenin and the palestinian leader three helicopters in gaza city the air strikes were launched on palestinian targets in the wake of weekend suicide attacks by the islamic militant group hamas which left israelis dead meanwhile hamas has defied the palestinian state of emergency and called for more suicide attacks against israel at the funeral of gunman who killed settler more than supporters of the hardline group gathered to bury year old muslim al aarage one of two palestinians who shot the settler dead on sunday in the north of the gaza strip before being killed by israeli soldiers the suicide operations will continue as long as the enemy continues its occupation of palestinian lands in the gaza strip and west bank militant from the group told crowd with loudspeaker when sharon kills women and children our people have the right to defend ourselves then they call us terrorists he said every religion and law in the world gives us the right to defend ourselves he said shortly before the air strikes began security services have arrested some militants from hamas and its smaller rival islamic jihad in the crackdown since sunday human rights group amnesty international has condemned deliberate attacks by the palestinian suicide bombers at the weekend these attacks are horrifying and tragic amnesty said in statement we call on armed groups to end immediately the direct targeting of civilians which contravenes the most fundamental principles of humanity the organisation called on the israeli government and the palestinian authority to remember that no abuses of human rights by armed groups can excuse violations of fundamental human rights and humanitarian law»\n", "\n" ] } @@ -622,10 +479,7 @@ }, { "cell_type": "markdown", - "metadata": { - "deletable": true, - "editable": true - }, + "metadata": {}, "source": [ "### Wrapping Up\n", "\n", @@ -635,23 +489,23 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 2", + "display_name": "Python 3", "language": "python", - "name": "python2" + "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.6.2" } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 1 }