Merge branch 'master' of https://github.com/piskvorky/gensim into doc…

…2vec-wikipedia
piskvorky · Sep 9, 2016 · 9ec9175 · 9ec9175
2 parents e4c8622 + 7c118fc
commit 9ec9175
Show file tree

Hide file tree

Showing 99 changed files with 15,423 additions and 3,297 deletions.
diff --git a/.gitignore b/.gitignore
@@ -48,6 +48,7 @@ Thumbs.db
 cython_debug
 docs/src/_build/
 docs/_static
+docs/notebooks/.ipynb_checkpoints
 dedan_gensim.tmproj
 gensim*.egg-info
 *,cover
@@ -66,3 +67,4 @@ data
 *.old
 *.model
 *~
+*_out.txt
diff --git a/.travis.yml b/.travis.yml
@@ -15,5 +15,8 @@ before_install:
 install:
   - conda create --yes -n gensim-test python=$TRAVIS_PYTHON_VERSION pip atlas numpy scipy
   - source activate gensim-test
+  - pip install pyemd
+  - pip install annoy
+  - pip install testfixtures
   - python setup.py install
 script: python setup.py test
diff --git a/CHANGELOG.txt → CHANGELOG.md b/CHANGELOG.txt → CHANGELOG.md
@@ -1,7 +1,63 @@
 Changes
 =======
 
-0.12.4, 29/01/2016
+0.13.2, 2016-08-19
+
+* wordtopics has changed to word_topics in ldamallet, and fixed issue #764. (@bhargavvader, [#771](https://github.com/RaRe-Technologies/gensim/pull/771)) 
+  - assigning wordtopics value of word_topics to keep backward compatibility, for now
+* topics, topn parameters changed to num_topics and num_words in show_topics() and print_topics()(@droudy, [#755](https://github.com/RaRe-Technologies/gensim/pull/755))
+  - In hdpmodel and dtmmodel
+  - NOT BACKWARDS COMPATIBLE!
+* Added random_state parameter to LdaState initializer and check_random_state() (@droudy, [#113](https://github.com/RaRe-Technologies/gensim/pull/113))
+* Topic coherence update with `c_uci`, `c_npmi` measures.  LdaMallet, LdaVowpalWabbit support. Add `topics` parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value. Faster backtracking. (@dsquareindia, [#750](https://github.com/RaRe-Technologies/gensim/pull/750), [#793](https://github.com/RaRe-Technologies/gensim/pull/793))
+* Added a check for empty (no words) documents before starting to run the DTM wrapper if model = "fixed" is used (DIM model) as this    causes the an error when such documents are reached in training. (@eickho, [#806](https://github.com/RaRe-Technologies/gensim/pull/806))
+* New parameters `limit`, `datatype` for load_word2vec_format(); `lockf` for intersect_word2vec_format (@gojomo, [#817](https://github.com/RaRe-Technologies/gensim/pull/817))
+* Changed `use_lowercase` option in word2vec accuracy to `case_insensitive` to account for case variations in training vocabulary (@jayantj, [#804](https://github.com/RaRe-Technologies/gensim/pull/804)
+* Link to Doc2Vec on airline tweets example in tutorials page (@544895340 , [#823](https://github.com/RaRe-Technologies/gensim/pull/823))
+* Small error on Doc2vec notebook tutorial (@charlessutton, [#816](https://github.com/RaRe-Technologies/gensim/pull/816))
+* Bugfix: Full2sparse clipped to use abs value (@tmylk, [#811](https://github.com/RaRe-Technologies/gensim/pull/811))
+* WMD docstring: add tutorial link and query example (@tmylk, [#813](https://github.com/RaRe-Technologies/gensim/pull/813))
+* Annoy integration to speed word2vec and doc2vec similarity. Tutorial update (@droudy, [#799](https://github.com/RaRe-Technologies/gensim/pull/799),[#792](https://github.com/RaRe-Technologies/gensim/pull/799) )
+* Add converter of LDA model between Mallet, Vowpal Wabit and gensim (@dsquareindia, [#798](https://github.com/RaRe-Technologies/gensim/pull/798), [#766](https://github.com/RaRe-Technologies/gensim/pull/766))
+* Distributed LDA in different network segments without broadcast (@menshikh-iv , [#782](https://github.com/RaRe-Technologies/gensim/pull/782))
+* Update Corpora_and_Vector_Spaces.ipynb (@megansquire, [#772](https://github.com/RaRe-Technologies/gensim/pull/772))
+* DTM wrapper bug fixes caused by renaming num_words in #755 (@bhargavvader,  [#770](https://github.com/RaRe-Technologies/gensim/pull/770))
+* Add LsiModel.docs_processed attribute (@hobson, [#763](https://github.com/RaRe-Technologies/gensim/pull/763))
+* Dynamic Topic Modelling in Python. Google Summer of Code 2016 project. (@bhargavvader, [#739, #831](https://github.com/RaRe-Technologies/gensim/pull/739))
+
+0.13.1, 2016-06-22
+
+* Topic coherence C_v and U_mass (@dsquareindia, #710)
+
+0.13.0, 2016-06-21
+
+* Added Distance Metrics to matutils.pt (@bhargavvader, #656)
+* Tutorials migrated from website to ipynb (@j9chan, #721), (@jesford, #733), (@jesford, #725), (@jesford, #716)
+* New doc2vec intro tutorial (@seanlaw, #730)
+* Gensim Quick Start Tutorial (@andrewjlm, #727)
+* Add export_phrases(sentences) to model Phrases (hanabi1224 #588)
+* SparseMatrixSimilarity returns a sparse matrix if `maintain_sparsity` is True (@davechallis, #590)
+* added functionality for Topics of Words in document - i.e, dynamic topics. (@bhargavvader, #704)
+  - also included tutorial which explains new functionalities, and document word-topic colring.
+* Made normalization an explicit transformation. Added 'l1' norm support (@dsquareindia, #649)
+* added term-topics API for most probable topic for word in vocab. (@bhargavvader, #706)
+* build_vocab takes progress_per parameter for smaller output (@zer0n, #624)
+* Control whether to use lowercase for computing word2vec accuracy. (@alantian, #607)
+* Easy import of GloVe vectors using Gensim (Manas Ranjan Kar, #625)
+  - Allow easy port of GloVe vectors into Gensim
+  - Standalone script with command line arguments, compatible with Python>=2.6 
+  - Usage: python -m gensim.scripts.glove2word2vec -i glove_vectors.txt -o output_word2vec_compatible.txt
+* Add `similar_by_word()` and `similar_by_vector()` to word2vec (@isohyt, #381)
+* Convenience method for similarity of two out of training sentences to doc2vec (@ellolo, #707)
+* Dynamic Topic Modelling Tutorial updated with Dynamic Influence Model (@bhargavvader, #689)
+* Added function to filter 'n' most frequent words from the dictionary (@abhinavchawla, #718)
+* Raise warnings if vocab is single character elements and if alpha is increased in word2vec/doc2vec (@dsquareindia, #705)
+* Tests for wikidump (@jonmcoe, #723)
+* Mallet wrapper sparse format support (@RishabGoel, #664)
+* Doc2vec pre-processing script translated from bash to Python (@andrewjlm, #720)
+
+
+0.12.4, 2016-01-29
 
 * Better internal handling of job batching in word2vec (#535)
   - up to 300% speed up when training on very short documents (~tweets)
@@ -32,7 +88,7 @@ Changes
     chunks_as_numpy=True/False (defaults to False) that allows controlling
     this behaviour
 
-0.12.3, 05/11/2015
+0.12.3, 2015-11-05
 
 * Make show_topics return value consistent across models (Christopher Corley, #448)
   - All models with the `show_topics` method should return a list of
@@ -52,7 +108,7 @@ Changes
 * OSX wheels (#504)
 * Win build (#492)
 
-0.12.2, 19/09/2015
+0.12.2, 2015-09-19
 
 * tutorial on text summarization (Ólavur Mortensen, #436)
 * more flexible vocabulary construction in word2vec & doc2vec (Philipp Dowling, #434)
@@ -63,7 +119,7 @@ Changes
 * Windows fix for setup.py (#428)
 * fix compatibility for scipy 0.16.0 (#415)
 
-0.12.1, 20/07/2015
+0.12.1, 2015-07-20
 
 * improvements to testing, switch to Travis CI containers
 * support for loading old word2vec models (<=0.11.1) in 0.12+ (Gordon Mohr, #405)
@@ -72,7 +128,7 @@ Changes
 * support for word2vec[['word1', 'word2'...]] convenience API calls (Satish Palaniappan, #395)
 * MatrixSimilarity supports indexing generator corpora (single pass)
 
-0.12.0, 06/07/2015
+0.12.0, 2015-07-06
 
 * complete API, performance, memory overhaul of doc2vec (Gordon Mohr, #356, #373, #380, #384)
   - fast infer_vector(); optional memory-mapped doc vectors; memory savings with int doc IDs
@@ -100,7 +156,7 @@ Changes
 * various doc improvements and fixes (Matti Lyra #331, Hongjoo Lee #334)
 * fixes and improvements to LDA (Christopher Corley #323)
 
-0.11.0 = 0.11.1 = 0.11.1-1, 10/04/2015
+0.11.0 = 0.11.1 = 0.11.1-1, 2015-04-10
 
 * added "topic ranking" to sort topics by coherence in LdaModel (jtmcmc, #311)
 * new fast ShardedCorpus out-of-core corpus (Jan Hajic jr., #284)
@@ -114,7 +170,7 @@ Changes
 * lots of small fixes & py3k compatibility improvements (Chyi-Kwei Yau, Daniel Nouri, Timothy Emerick, Juarez Bochi, Christopher Corley, Chirag Nagpal, Jan Hajic jr., Flávio Codeço Coelho)
 * re-released as 0.11.1 and 0.11.1-1 because of a packaging bug
 
-0.10.3, 17/11/2014
+0.10.3, 2014-11-17
 
 * added streamed phrases = collocation detection (Miguel Cabrera, #258)
 * added param for multiple word2vec epochs (sebastienj, #243)
@@ -126,7 +182,7 @@ Changes
 * fixes to setup.py (Maxim Avanov and Christopher Corley, #260, #251)
 * ...and lots of minor fixes & updates all around
 
-0.10.2, 18/09/2014
+0.10.2, 2014-09-18
 
 * new parallelized, LdaMulticore implementation (Jan Zikes, #232)
 * Dynamic Topic Models (DTM) wrapper (Arttii, #205)
@@ -138,7 +194,7 @@ Changes
 * py3k fix to SparseCorpus (Andreas Madsen, #234)
 * fix to LowCorpus when switching dictionaries (Christopher Corley, #237)
 
-0.10.1, 22/07/2014
+0.10.1, 2014-07-22
 
 * word2vec: new n_similarity method for comparing two sets of words (François Scharffe, #219)
 * make LDA print/show topics parameters consistent with LSI (Bram Vandekerckhove, #201)
@@ -153,7 +209,7 @@ Changes
 * ignore non-articles during wiki parsig
 * utils.lemmatize now (optionally) ignores stopwords
 
-0.10.0 (aka "PY3K port"), 04/06/2014
+0.10.0 (aka "PY3K port"), 2014-06-04
 
 * full Python 3 support (targeting 3.3+, #196)
 * all internal methods now expect & store unicode, instead of utf8
@@ -164,7 +220,7 @@ Changes
 * added py3.3 and 3.4 to Travis CI tests
 * fix a cbow word2vec bug (Liang-Chi Hsieh)
 
-0.9.1, 12/04/2014
+0.9.1, 2014-04-12
 
 * MmCorpus fix for Windows
 * LdaMallet support for printing/showing topics
@@ -174,7 +230,7 @@ Changes
 * more py3k fixes (Lars Buitinck)
 * change order of LDA topic printing (Fayimora Femi-Balogun, #188)
 
-0.9.0, 16/03/2014
+0.9.0, 2014-03-16
 
 * save/load automatically single out large arrays + allow mmap
 * allow .gz/.bz2 corpus filenames => transparently (de)compressed I/O
@@ -192,7 +248,7 @@ Changes
 * parametrize LDA constructor (Christopher Corley, #174)
 * steps toward py3k compatibility (Lars Buitinck, #154)
 
-0.8.9, 26/12/2013
+0.8.9, 2013-12-26
 
 * use travis-ci for continuous integration
 * auto-optimize LDA asymmetric prior (Ben Trahan)
@@ -204,7 +260,7 @@ Changes
 * allow compressed input in LineSentence corpus (Eric Moyer)
 * upgrade ez_setup, doc improvements, minor fixes etc.
 
-0.8.8 (aka "word2vec release"), 03/11/2013
+0.8.8 (aka "word2vec release"), 2013-11-03
 
 * python3 port by Parikshit Samant: https://github.com/samantp/gensimPy3
 * massive optimizations to word2vec (cython, BLAS, multithreading): ~20x-300x speedup
@@ -214,7 +270,7 @@ Changes
 * add context manager support for older Python<=2.6 for gzip and bz2
 * added unittests for word2vec
 
-0.8.7, 18/09/2013
+0.8.7, 2013-09-18
 
 * initial version of word2vec, a neural network deep learning algo
 * make distributed gensim compatible with the new Pyro
@@ -232,23 +288,23 @@ Changes
 * fixes for more robust Windows multiprocessing
 * lots of small fixes, data checks and documentation updates
 
-0.8.6, 15/09/2012
+0.8.6, 2012-09-15
 
 * added HashDictionary (by Homer Strong)
 * support for adding target classes in SVMlight format (by Corrado Monti)
 * fixed problems with global lemmatizer object when running in parallel on Windows
 * parallelization of Wikipedia processing + added script version that lemmatizes the input documents
 * added class method to initialize Dictionary from an existing corpus (by Marko Burjek)
 
-0.8.5, 22/07/2012
+0.8.5, 2012-07-22
 
 * improved performance of sharding (similarity queries)
 * better Wikipedia parsing (thx to Alejandro Weinstein and Lars Buitinck)
 * faster Porter stemmer (thx to Lars Buitinck)
 * several minor fixes (in HDP model thx to Greg Ver Steeg)
 * improvements to documentation
 
-0.8.4, 09/03/2012
+0.8.4, 2012-03-09
 
 * better support for Pandas series input (thx to JT Bates)
 * a new corpus format: UCI bag-of-words (thx to Jonathan Esterhazy)
@@ -257,13 +313,13 @@ Changes
 * lemmatizer support for wikipedia parsing (via the `pattern` python package)
 * extended the lemmatizer for multi-core processing, to improve its performance
 
-0.8.3, 02/12/2011
+0.8.3, 2011-12-02
 
 * fixed Similarity sharding bug (issue #65, thx to Paul Rudin)
 * improved LDA code (clarity & memory footprint)
 * optimized efficiency of Similarity sharding
 
-0.8.2, 31/10/2011
+0.8.2, 2011-10-31
 
 * improved gensim landing page
 * improved accuracy of SVD (Latent Semantic Analysis) (thx to Mark Tygert)
@@ -272,7 +328,7 @@ Changes
 * started using `tox` for testing
 * + several smaller fixes and optimizations
 
-0.8.1, 10/10/2011
+0.8.1, 2011-10-10
 
 * transactional similarity server: see docs/simserver.html
 * website moved from university hosting to radimrehurek.com
@@ -283,22 +339,22 @@ Changes
 * model.print_topics() debug fncs now support std output, in addition to logging (thx to Homer Strong)
 * several smaller fixes and improvements
 
-0.8.0 (Armageddon), 28/06/2011
+0.8.0 (Armageddon), 2011-06-28
 
 * changed all variable and function names to comply with PEP8 (numTopics->num_topics): BREAKS BACKWARD COMPATIBILITY!
 * added support for similarity querying more documents at once (index[query_documents] in addition to index[query_document]; much faster)
 * rewrote Similarity so that it is more efficient and scalable (using disk-based mmap'ed shards)
 * simplified directory structure (src/gensim/ is now only gensim/)
 * several small fixes and optimizations
 
-0.7.8, 26/03/2011
+0.7.8, 2011-03-26
 
 * added `corpora.IndexedCorpus`, a base class for corpus serializers (thx to Dieter Plaetinck). This allows corpus formats that inherit from it (MmCorpus, SvmLightCorpus, BleiCorpus etc.) to retrieve individual documents by their id in O(1), e.g. `corpus[14]` returns document #14.
 * merged new code from the LarKC.eu team (`corpora.textcorpus`, `models.logentropy_model`, lots of unit tests etc.)
 * fixed a bug in `lda[bow]` transformation (was returning gamma distribution instead of theta). LDA model generation was not affected, only transforming new vectors.
 * several small fixes and documentation updates
 
-0.7.7, 13/02/2011
+0.7.7, 2011-02-13
 
 * new LDA implementation after Hoffman et al.: Online Learning for Latent Dirichlet Allocation
 * distributed LDA
@@ -307,12 +363,12 @@ Changes
 * moved code to github
 * started gensim Google group
 
-0.7.6, 10/01/2011
+0.7.6, 2011-01-10
 
 * added workaround for a bug in numpy: pickling a fortran-order array (e.g. LSA model) and then loading it back and using it results in segfault (thx to Brian Merrel)
 * bundled a new version of ez_setup.py: old failed with Python2.6 when setuptools were missing (thx to Alan Salmoni).
 
-0.7.5, 03/11/2010
+0.7.5, 2010-11-03
 
 * further optimization to LSA; this is the version used in my NIPS workshop paper
 * got rid of SVDLIBC dependency (one-pass LSA now uses stochastic algo for base-base decompositions)