Fix merge conflict

piskvorky · Feb 17, 2017 · df13670 · df13670
2 parents 67b1a17 + d692db4
commit df13670
Show file tree

Hide file tree

Showing 23 changed files with 7,794 additions and 163 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -2,10 +2,7 @@ sudo: false
 dist: trusty
 language: python
 python:
-  - "2.6"
   - "2.7"
-  - "3.3"
-  - "3.4"
   - "3.5"
   - "3.6"
 before_install:
@@ -21,5 +18,6 @@ install:
   - pip install annoy
   - pip install testfixtures
   - pip install unittest2
+  - pip install Morfessor==2.0.2a4
   - python setup.py install
 script: python setup.py test
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,13 @@ Changes
 
 Unreleased:
 
+1.0.0RC2, 2017-02-16
+
+* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137)) 
+* Remove direct access to properties moved to KeyedVectors (@tmylk,[#1147](https://github.com/RaRe-Technologies/gensim/pull/1147))
+* Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,[#1145](https://github.com/RaRe-Technologies/gensim/pull/1145))
+* Write UTF-8 byte strings in tensorboard conversion (@tmylk,[#1144](https://github.com/RaRe-Technologies/gensim/pull/1144))
+* Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,[#1146](https://github.com/RaRe-Technologies/gensim/pull/1146))
 
 1.0.0RC1, 2017-01-31
 

diff --git a/appveyor.yml b/appveyor.yml
@@ -14,21 +14,28 @@ environment:
 
   matrix:
     - PYTHON: "C:\\Python27"
-      PYTHON_VERSION: "2.7.8"
+      PYTHON_VERSION: "2.7.12"
       PYTHON_ARCH: "32"
 
     - PYTHON: "C:\\Python27-x64"
-      PYTHON_VERSION: "2.7.8"
+      PYTHON_VERSION: "2.7.12"
       PYTHON_ARCH: "64"
 
     - PYTHON: "C:\\Python35"
-      PYTHON_VERSION: "3.5.0"
+      PYTHON_VERSION: "3.5.2"
       PYTHON_ARCH: "32"
 
     - PYTHON: "C:\\Python35-x64"
-      PYTHON_VERSION: "3.5.0"
+      PYTHON_VERSION: "3.5.2"
       PYTHON_ARCH: "64"
+
+    - PYTHON: "C:\\Python36"
+      PYTHON_VERSION: "3.6.0"
+      PYTHON_ARCH: "32"
 
+    - PYTHON: "C:\\Python36-x64"
+      PYTHON_VERSION: "3.6.0"
+      PYTHON_ARCH: "64"
 
 
 install:
@@ -59,7 +66,7 @@ test_script:
   # installed library.
   - "mkdir empty_folder"
   - "cd empty_folder"
-  - "pip install pyemd testfixtures unittest2"
+  - "pip install pyemd testfixtures unittest2 Morfessor==2.0.2a4"
 
   - "python -c \"import nose; nose.main()\" -s -v gensim"
   # Move back to the project folder

diff --git a/docs/notebooks/Varembed.ipynb b/docs/notebooks/Varembed.ipynb
@@ -0,0 +1,163 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# VarEmbed Tutorial\n",
+    "\n",
+    "Varembed is a word embedding model incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, varembed combines morphological and distributional information in a unified probabilistic framework. Varembed thus yields improvements on intrinsic word similarity evaluations. Check out the original paper, [arXiv:1608.01056](https://arxiv.org/abs/1608.01056) accepted in [EMNLP 2016](http://www.emnlp2016.net/accepted-papers.html).\n",
+    "\n",
+    "Varembed is now integrated into [Gensim](http://radimrehurek.com/gensim/) providing ability to load already trained varembed models into gensim with additional functionalities over word vectors already present in gensim.\n",
+    "\n",
+    "# This Tutorial\n",
+    "\n",
+    "In this tutorial you will learn how to train, load and evaluate varembed model on your data.\n",
+    "\n",
+    "# Train Model\n",
+    "\n",
+    "The authors provide their code to train a varembed model. Checkout the repository [MorphologicalPriorsForWordEmbeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings) for to train a varembed model. You'll need to use that code if you want to train a model. \n",
+    "\n",
+    "# Load Varembed Model\n",
+    "\n",
+    "Now that you have an already trained varembed model, you can easily load the varembed word vectors directly into Gensim. <br>\n",
+    "For that, you need to provide the path to the word vectors pickle file generated after you train the model and run the script to [package varembed embeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings/blob/master/package_embeddings.py) provided in the [varembed source code repository](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings).\n",
+    "\n",
+    "We'll use a varembed model trained on [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor) as the vocabulary, which is already available in gensim.\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from gensim.models.wrappers import varembed\n",
+    "\n",
+    "vector_file = '../../gensim/test/test_data/varembed_leecorpus_vectors.pkl'\n",
+    "model = varembed.VarEmbed.load_varembed_format(vectors=vector_file)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This loads a varembed model into Gensim. Also if you want to load with morphemes added into the varembed vectors, you just need to also provide the path to the trained morfessor model binary as an argument. This works as an optional parameter, if not provided, it would just load the varembed vectors without morphemes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "morfessor_file = '../../gensim/test/test_data/varembed_leecorpus_morfessor.bin'\n",
+    "model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=vector_file, morfessor_model=morfessor_file)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This helps load trained varembed models into Gensim. Now you can use this for any of the Keyed Vector functionalities, like 'most_similar', 'similarity' and so on, already provided in gensim. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[(u'launch', 0.2694973647594452),\n",
+       " (u'again', 0.2564533054828644),\n",
+       " (u'gun', 0.2521245777606964),\n",
+       " (u'response', 0.24817466735839844),\n",
+       " (u'swimming', 0.23348823189735413),\n",
+       " (u'bombings', 0.23146548867225647),\n",
+       " (u'transformed', 0.2289058119058609),\n",
+       " (u'used', 0.2224646955728531),\n",
+       " (u'weeks,', 0.21905183792114258),\n",
+       " (u'scheduled', 0.2170265018939972)]"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.most_similar('government')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.022313305789051038"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.similarity('peace', 'grim')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Conclusion\n",
+    "In this tutorial, we learnt how to load already trained varembed models vectors into gensim and easily use and evaluate it. That's it!\n",
+    "\n",
+    "# Resources\n",
+    "\n",
+    "* [Varembed Source Code](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings)\n",
+    "* [Gensim](http://radimrehurek.com/gensim/)\n",
+    "* [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python [default]",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs/notebooks/annoytutorial.ipynb b/docs/notebooks/annoytutorial.ipynb
@@ -177,7 +177,9 @@
     "\n",
     "**This speedup factor is by no means constant** and will vary greatly from run to run and is particular to this data set, BLAS setup, Annoy parameters(as tree size increases speedup factor decreases), machine specifications, among other factors.\n",
     "\n",
-    ">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized"
+    ">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized\n",
+    "\n",
+    ">**Note** : Gensim's 'most_similar' method is using numpy operations in the form of dot product whereas Annoy's method isnt. If 'numpy' on your machine is using one of the BLAS libraries like ATLAS or LAPACK, it'll run on multiple cores(only if your machine has multicore support ). Check [SciPy Cookbook](http://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html) for more details."
    ]
   },
   {

diff --git a/docs/notebooks/doc2vec-IMDB.ipynb b/docs/notebooks/doc2vec-IMDB.ipynb
@@ -13,7 +13,13 @@
    "source": [
     "TODO: section on introduction & motivation\n",
     "\n",
-    "TODO: prerequisites + dependencies (statsmodels, patsy, ?)"
+    "TODO: prerequisites + dependencies (statsmodels, patsy, ?)\n",
+    "\n",
+    "### Requirements\n",
+    "Following are the dependencies for this tutorial:\n",
+    "    - testfixtures\n",
+    "    - statsmodels\n",
+    "    "
    ]
   },
   {

diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -45,6 +45,7 @@ Modules:
     models/wrappers/dtmmodel
     models/wrappers/ldavowpalwabbit.rst
     models/wrappers/wordrank
+    models/wrappers/varembed
     similarities/docsim
     similarities/index
     topic_coherence/aggregation

diff --git a/docs/src/conf.py b/docs/src/conf.py
@@ -54,7 +54,7 @@
 # The short X.Y version.
 version = '1.0'
 # The full version, including alpha/beta/rc tags.
-release = '1.0.0rc1'
+release = '1.0.0rc2'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/src/models/wrappers/varembed.rst b/docs/src/models/wrappers/varembed.rst
@@ -0,0 +1,9 @@
+:mod:`models.wrappers.varembed` -- VarEmbed Word Embeddings
+================================================================================================
+
+.. automodule:: gensim.models.wrappers.varembed
+    :synopsis: VarEmbed Word Embeddings
+    :members:
+    :inherited-members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/gensim/matutils.py b/gensim/matutils.py
@@ -206,6 +206,9 @@ def sparse2full(doc, length):
 
     """
     result = np.zeros(length, dtype=np.float32)  # fill with zeroes (default value)
+    # convert indices to int as numpy 1.12 no longer indexes by floats
+    doc = ((int(id_), float(val_)) for (id_, val_) in doc)
+
     doc = dict(doc)
     # overwrite some of the zeroes with explicit values
     result[list(doc)] = list(itervalues(doc))

diff --git a/gensim/models/ldamodel.py b/gensim/models/ldamodel.py
@@ -862,7 +862,7 @@ def top_topics(self, corpus, num_words=20):
             for m in top_words[1:]:
                 # m_docs is v_m^(t)
                 m_docs = doc_word_list[m]
-                m_index = np.where(top_words == m)[0]
+                m_index = np.where(top_words == m)[0][0]
 
                 # Sum of top words l=1..m
                 # i.e., all words ranked higher than the current word m