Skip to content

Commit

Permalink
Fix merge conflict
Browse files Browse the repository at this point in the history
  • Loading branch information
tmylk committed Feb 17, 2017
2 parents 67b1a17 + d692db4 commit df13670
Show file tree
Hide file tree
Showing 23 changed files with 7,794 additions and 163 deletions.
4 changes: 1 addition & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,7 @@ sudo: false
dist: trusty
language: python
python:
- "2.6"
- "2.7"
- "3.3"
- "3.4"
- "3.5"
- "3.6"
before_install:
Expand All @@ -21,5 +18,6 @@ install:
- pip install annoy
- pip install testfixtures
- pip install unittest2
- pip install Morfessor==2.0.2a4
- python setup.py install
script: python setup.py test
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@ Changes

Unreleased:

1.0.0RC2, 2017-02-16

* Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,[#1137](https://github.com/RaRe-Technologies/gensim/pull/1137))
* Remove direct access to properties moved to KeyedVectors (@tmylk,[#1147](https://github.com/RaRe-Technologies/gensim/pull/1147))
* Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,[#1145](https://github.com/RaRe-Technologies/gensim/pull/1145))
* Write UTF-8 byte strings in tensorboard conversion (@tmylk,[#1144](https://github.com/RaRe-Technologies/gensim/pull/1144))
* Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,[#1146](https://github.com/RaRe-Technologies/gensim/pull/1146))

1.0.0RC1, 2017-01-31

Expand Down
17 changes: 12 additions & 5 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,28 @@ environment:

matrix:
- PYTHON: "C:\\Python27"
PYTHON_VERSION: "2.7.8"
PYTHON_VERSION: "2.7.12"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python27-x64"
PYTHON_VERSION: "2.7.8"
PYTHON_VERSION: "2.7.12"
PYTHON_ARCH: "64"

- PYTHON: "C:\\Python35"
PYTHON_VERSION: "3.5.0"
PYTHON_VERSION: "3.5.2"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python35-x64"
PYTHON_VERSION: "3.5.0"
PYTHON_VERSION: "3.5.2"
PYTHON_ARCH: "64"

- PYTHON: "C:\\Python36"
PYTHON_VERSION: "3.6.0"
PYTHON_ARCH: "32"

- PYTHON: "C:\\Python36-x64"
PYTHON_VERSION: "3.6.0"
PYTHON_ARCH: "64"


install:
Expand Down Expand Up @@ -59,7 +66,7 @@ test_script:
# installed library.
- "mkdir empty_folder"
- "cd empty_folder"
- "pip install pyemd testfixtures unittest2"
- "pip install pyemd testfixtures unittest2 Morfessor==2.0.2a4"

- "python -c \"import nose; nose.main()\" -s -v gensim"
# Move back to the project folder
Expand Down
163 changes: 163 additions & 0 deletions docs/notebooks/Varembed.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# VarEmbed Tutorial\n",
"\n",
"Varembed is a word embedding model incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, varembed combines morphological and distributional information in a unified probabilistic framework. Varembed thus yields improvements on intrinsic word similarity evaluations. Check out the original paper, [arXiv:1608.01056](https://arxiv.org/abs/1608.01056) accepted in [EMNLP 2016](http://www.emnlp2016.net/accepted-papers.html).\n",
"\n",
"Varembed is now integrated into [Gensim](http://radimrehurek.com/gensim/) providing ability to load already trained varembed models into gensim with additional functionalities over word vectors already present in gensim.\n",
"\n",
"# This Tutorial\n",
"\n",
"In this tutorial you will learn how to train, load and evaluate varembed model on your data.\n",
"\n",
"# Train Model\n",
"\n",
"The authors provide their code to train a varembed model. Checkout the repository [MorphologicalPriorsForWordEmbeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings) for to train a varembed model. You'll need to use that code if you want to train a model. \n",
"\n",
"# Load Varembed Model\n",
"\n",
"Now that you have an already trained varembed model, you can easily load the varembed word vectors directly into Gensim. <br>\n",
"For that, you need to provide the path to the word vectors pickle file generated after you train the model and run the script to [package varembed embeddings](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings/blob/master/package_embeddings.py) provided in the [varembed source code repository](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings).\n",
"\n",
"We'll use a varembed model trained on [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor) as the vocabulary, which is already available in gensim.\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.wrappers import varembed\n",
"\n",
"vector_file = '../../gensim/test/test_data/varembed_leecorpus_vectors.pkl'\n",
"model = varembed.VarEmbed.load_varembed_format(vectors=vector_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This loads a varembed model into Gensim. Also if you want to load with morphemes added into the varembed vectors, you just need to also provide the path to the trained morfessor model binary as an argument. This works as an optional parameter, if not provided, it would just load the varembed vectors without morphemes."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"morfessor_file = '../../gensim/test/test_data/varembed_leecorpus_morfessor.bin'\n",
"model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=vector_file, morfessor_model=morfessor_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This helps load trained varembed models into Gensim. Now you can use this for any of the Keyed Vector functionalities, like 'most_similar', 'similarity' and so on, already provided in gensim. \n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'launch', 0.2694973647594452),\n",
" (u'again', 0.2564533054828644),\n",
" (u'gun', 0.2521245777606964),\n",
" (u'response', 0.24817466735839844),\n",
" (u'swimming', 0.23348823189735413),\n",
" (u'bombings', 0.23146548867225647),\n",
" (u'transformed', 0.2289058119058609),\n",
" (u'used', 0.2224646955728531),\n",
" (u'weeks,', 0.21905183792114258),\n",
" (u'scheduled', 0.2170265018939972)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar('government')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.022313305789051038"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.similarity('peace', 'grim')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion\n",
"In this tutorial, we learnt how to load already trained varembed models vectors into gensim and easily use and evaluate it. That's it!\n",
"\n",
"# Resources\n",
"\n",
"* [Varembed Source Code](https://github.com/rguthrie3/MorphologicalPriorsForWordEmbeddings)\n",
"* [Gensim](http://radimrehurek.com/gensim/)\n",
"* [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee.cor)\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
4 changes: 3 additions & 1 deletion docs/notebooks/annoytutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,9 @@
"\n",
"**This speedup factor is by no means constant** and will vary greatly from run to run and is particular to this data set, BLAS setup, Annoy parameters(as tree size increases speedup factor decreases), machine specifications, among other factors.\n",
"\n",
">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized"
">**Note**: Initialization time for the annoy indexer was not included in the times. The optimal knn algorithm for you to use will depend on how many queries you need to make and the size of the corpus. If you are making very few similarity queries, the time taken to initialize the annoy indexer will be longer than the time it would take the brute force method to retrieve results. If you are making many queries however, the time it takes to initialize the annoy indexer will be made up for by the incredibly fast retrieval times for queries once the indexer has been initialized\n",
"\n",
">**Note** : Gensim's 'most_similar' method is using numpy operations in the form of dot product whereas Annoy's method isnt. If 'numpy' on your machine is using one of the BLAS libraries like ATLAS or LAPACK, it'll run on multiple cores(only if your machine has multicore support ). Check [SciPy Cookbook](http://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html) for more details."
]
},
{
Expand Down
8 changes: 7 additions & 1 deletion docs/notebooks/doc2vec-IMDB.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,13 @@
"source": [
"TODO: section on introduction & motivation\n",
"\n",
"TODO: prerequisites + dependencies (statsmodels, patsy, ?)"
"TODO: prerequisites + dependencies (statsmodels, patsy, ?)\n",
"\n",
"### Requirements\n",
"Following are the dependencies for this tutorial:\n",
" - testfixtures\n",
" - statsmodels\n",
" "
]
},
{
Expand Down
1 change: 1 addition & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Modules:
models/wrappers/dtmmodel
models/wrappers/ldavowpalwabbit.rst
models/wrappers/wordrank
models/wrappers/varembed
similarities/docsim
similarities/index
topic_coherence/aggregation
Expand Down
2 changes: 1 addition & 1 deletion docs/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
# The short X.Y version.
version = '1.0'
# The full version, including alpha/beta/rc tags.
release = '1.0.0rc1'
release = '1.0.0rc2'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
9 changes: 9 additions & 0 deletions docs/src/models/wrappers/varembed.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.wrappers.varembed` -- VarEmbed Word Embeddings
================================================================================================

.. automodule:: gensim.models.wrappers.varembed
:synopsis: VarEmbed Word Embeddings
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
3 changes: 3 additions & 0 deletions gensim/matutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,9 @@ def sparse2full(doc, length):
"""
result = np.zeros(length, dtype=np.float32) # fill with zeroes (default value)
# convert indices to int as numpy 1.12 no longer indexes by floats
doc = ((int(id_), float(val_)) for (id_, val_) in doc)

doc = dict(doc)
# overwrite some of the zeroes with explicit values
result[list(doc)] = list(itervalues(doc))
Expand Down
2 changes: 1 addition & 1 deletion gensim/models/ldamodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -862,7 +862,7 @@ def top_topics(self, corpus, num_words=20):
for m in top_words[1:]:
# m_docs is v_m^(t)
m_docs = doc_word_list[m]
m_index = np.where(top_words == m)[0]
m_index = np.where(top_words == m)[0][0]

# Sum of top words l=1..m
# i.e., all words ranked higher than the current word m
Expand Down
Loading

0 comments on commit df13670

Please sign in to comment.