Releases: piskvorky/gensim
1.0.1 Bug-fix release
1.0.0 Author-Topic modelling
1.0.0, 2017-02-24
Deprecated methods:
In order to share word vector querying code between different training algos(Word2Vec, Fastext, WordRank, VarEmbed) we have separated storage and querying of word vectors into a separate class KeyedVectors
.
Two methods and several attributes in word2vec class have been deprecated. The methods are load_word2vec_format
and save_word2vec_format
. The attributes are syn0norm
, syn0
, vocab
, index2word
. They have been moved to KeyedVectors
class.
After upgrading to this release you might get exceptions about deprecated methods or missing attributes.
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.
AttributeError: 'Word2Vec' object has no attribute 'vocab'
To remove the exceptions, you should use
KeyedVectors.load_word2vec_format
instead of Word2Vec.load_word2vec_format
word2vec_model.wv.save_word2vec_format
instead of word2vec_model.save_word2vec_format
model.wv.syn0norm
instead of model.syn0norm
model.wv.syn0
instead of model.syn0
model.wv.vocab
instead of model.vocab
model.wv.index2word
instead of model.index2word
Changelog of this release:
New features:
- Add Author-topic modeling (@olavurmortensen,#893)
- Add FastText word embedding wrapper (@jayantj,#847)
- Add WordRank word embedding wrapper (@parulsethi,#1066, #1125)
- Add Varembed word embedding wrapper (@anmol01gulati, #1067))
- Add sklearn wrapper for LDAModel (@AadityaJ,#932)
Deprecated features:
- Move
load_word2vec_format
andsave_word2vec_format
out of Word2Vec class to KeyedVectors (@tmylk,#1107) - Move properties
syn0norm
,syn0
,vocab
,index2word
from Word2Vec class to KeyedVectors (@tmylk,#1147) - Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,#1145)
Improvements:
- Python 3.6 support (@tmylk #1077)
- Phrases and Phraser allow a generator corpus (ELind77 #1099)
- Ignore DocvecsArray.doctag_syn0norm in save. Fix #789 (@accraze,#1053)
- Fix bug in LsiModel that occurs when id2word is a Python 3 dictionary. (@cvangysel,#1103
- Fix broken link to paper in readme (@bhargavvader,#1101)
- Lazy formatting in evaluate_word_pairs (@akutuzov,#1084)
- Deacc option to keywords pre-processing (@bhargavvader,#1076)
- Generate Deprecated exception when using Word2Vec.load_word2vec_format (@tmylk, #1165)
- Fix hdpmodel constructor docstring for print_topics (#1152) (@toliwa, #1152)
- Default to per_word_topics=False in LDA get_item for performance (@menshikh-iv, #1154)
- Fix bound computation in Author Topic models. (@olavurmortensen, #1156)
- Write UTF-8 byte strings in tensorboard conversion (@tmylk,#1144)
- Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,#1146)
Tutorial and doc improvements:
- Clarifying comment in is_corpus func in utils.py (@greninja,#1109)
- Tutorial Topics_and_Transformations fix markdown and add references (@lgmoneda,#1120)
- Fix doc2vec-lee.ipynb results to match previous behavior (@bahbbc,#1119)
- Remove Pattern lib dependency in News Classification tutorial (@luizcavalcanti,#1118)
- Corpora_and_Vector_Spaces tutorial text clarification (@lgmoneda,#1116)
- Update Transformation and Topics link from quick start notebook (@mariana393,#1115)
- Quick Start Text clarification and typo correction (@luizcavalcanti,#1114)
- Fix typos in Author-topic tutorial (@Fil,#1102)
- Address benchmark inconsistencies in Annoy tutorial (@droudy,#1113)
- Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,#1137)
- Add documentation for WikiCorpus metadata. (@kirit93, #1163)
1.0.0RC2
1.0.0RC2, 2017-02-16
Deprecated methods:
In order to share word vector querying code between different training algos(Word2Vec, Fastext, WordRank, VarEmbed) we have separated storage and querying of word vectors into a separate class KeyedVectors
.
Two methods and several attributes in word2vec class have been deprecated. The methods are load_word2vec_format
and save_word2vec_format
. The attributes are syn0norm
, syn0
, vocab
, index2word
. They have been moved to KeyedVectors
class.
After upgrading to this release you might get exceptions about deprecated methods or missing attributes.
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.
AttributeError: 'Word2Vec' object has no attribute 'vocab'
To remove the exceptions, you should use
KeyedVectors.load_word2vec_format
instead of Word2Vec.load_word2vec_format
word2vec_model.wv.save_word2vec_format
instead of word2vec_model.save_word2vec_format
model.wv.syn0norm
instead of model.syn0norm
model.wv.syn0
instead of model.syn0
model.wv.vocab
instead of model.vocab
model.wv.index2word
instead of model.index2word
- Add note about Annoy speed depending on numpy BLAS setup in annoytutorial.ipynb (@greninja,#1137)
- Remove direct access to properties moved to KeyedVectors (@tmylk,#1147)
- Remove support for Python 2.6, 3.3 and 3.4 (@tmylk,#1145)
- Write UTF-8 byte strings in tensorboard conversion (@tmylk,#1144)
- Make top_topics and sparse2full compatible with numpy 1.12 strictly int idexing (@tmylk,#1146)
Bug-fix for KeyedVector warnings in word2vec/doc2vec
0.13.4.1, 2017-01-04
- Disable direct access warnings on save and load of Word2vec/Doc2vec (@tmylk, #1072)
- Making Default hs error explicit (@accraze, #1054)
- Removed unnecessary numpy imports (@bhargavvader, #1065)
- Utils and Matutils changes (@bhargavvader, #1062)
- Tests for the evaluate_word_pairs function (@akutuzov, #1061)
KeyedVectors
Deprecation warning
After upgrading to this release you might see deprecation warnings like this:
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm
These warnings are correct and you are encouraged to change your Word2vec/Doc2vec code to use the new model.wv.syn0norm and model.wv.vocab fields instead of old direct access like model.syn0norm and model.vocab. The direct access will be deprecated in Feb 2017.
Specifically, you should use
model.wv.syn0norm
instead of model.syn0norm
model.wv.syn0
instead of model.syn0
model.wv.vocab
instead of model.vocab
model.wv.index2word
instead of model.index2word
The reason for this deprecation is to separate word vectors from word2vec training. There are now new ways to get word vectors that don't involve training word2vec. We are adding capabilities to use word vectors trained in GloVe, FastText, WordRank, Tensorflow and Deeplearning4j word2vec. In order to have cleaner code and standard APIs for all word embeddings we extracted a KeyedVectors
class and a word-vectors wv
variable into the models.
0.13.4, 2016-12-22
Changelog:
- Evaluation of word2vec models against semantic similarity datasets like SimLex-999 (#1047) (@akutuzov, #1047)
- TensorBoard word embedding visualisation of Gensim Word2vec format (@loretoparisi, #1051)
- Throw exception if load() is called on instance rather than the class in word2vec and doc2vec (@dus0x,(#889)
- Loading and Saving LDA Models across Python 2 and 3. Fix #853 (@anmolgulati, #913, #1093)
- Fix automatic learning of eta (prior over words) in LDA (@olavurmortensen, #1024).
- eta should have dimensionality V (size of vocab) not K (number of topics). eta with shape K x V is still allowed, as the user may want to impose specific prior information to each topic.
- eta is no longer allowed the "asymmetric" option. Asymmetric priors over words in general are fine (learned or user defined).
- As a result, the eta update (
update_eta
) was simplified some. It also no longer logs eta when updated, because it is too large for that. - Unit tests were updated accordingly. The unit tests expect a different shape than before; some unit tests were redundant after the change;
eta='asymmetric'
now should raise an error.
- Optimise show_topics to only call get_lambda once. Fix #1006. (@bhargavvader, #1028)
- HdpModel doc improvement. Inference and print_topics (@dsquareindia, #1029)
- Removing Doc2Vec defaults so that it won't override Word2Vec defaults. Fix #795 (@markroxor, #929)
Remove warning on gensim import "pattern not installed". Fix #1009 (@shashankg7, #1018) - Add delete_temporary_training_data() function to word2vec and doc2vec models. (@deepmipt-VladZhukov, #987)
- New class KeyedVectors to store embedding separate from training code (@anmol01gulati and @droudy, #980)
- Documentation improvements (@IrinaGoloshchapova, #1010, #1011)
- LDA tutorial by Olavur, tips and tricks (@olavurmortensen, #779)
- Add double quote in commmand line to run on Windows (@akarazeev, #1005)
- Fix directory names in notebooks to be OS-independent (@mamamot, #1004)
- Respect clip_start, clip_end in most_similar. Fix #601. (@parulsethi, #994)
- Replace Python sigmoid function with scipy in word2vec & doc2vec (@markroxor, #989)
- WMD to return 0 instead of inf for sentences that contain a single word (@rbahumi, #986)
- Pass all the params through the apply call in lda.get_document_topics(), test case to use the per_word_topics through the corpus in test_ldamodel (@parthoiiitm, #978)
- Pyro annotations for lsi_worker (@markroxor, #968)
Word2vec vocabulary expansion and documentation improvements
0.13.3, 2016-10-20
- Add vocabulary expansion feature to word2vec. (@isohyt, #900)
- Tutorial: Reproducing Doc2vec paper result on wikipedia. (@isohyt, #654)
- Add Save/Load interface to AnnoyIndexer for index persistence (@fortiema, #845)
- Fixed issue #938,Creating a unified base class for all topic models. (@markroxor, #946)
- breaking change in HdpTopicFormatter.show___topics
- Add Phraser for Phrases optimization. ( @gojomo & @anujkhare , #837)
- Fix issue #743, in word2vec's n_similarity method if at least one empty list is passed ZeroDivisionError is raised (@pranay360, #883)
- Change export_phrases in Phrases model. Fix issue #794 (@AadityaJ, #879)
- bigram construction can now support multiple bigrams within one sentence
- Fix issue #838, RuntimeWarning: overflow encountered in exp (@markroxor, #895)
- Change some log messages to warnings as suggested in issue #828. (@rhnvrm, #884)
- Fix issue #851, In summarizer.py, RunTimeError is raised if single sentence input is provided to avoid ZeroDivionError. (@metalaman, #887)
- Fix issue #791, correct logic for iterating over SimilarityABC interface. (@MridulS, #839)
- Fix RP model loading for large Fortran-order arrays (@piskvorky, #605)
- Remove ShardedCorpus from init because of Theano dependency (@tmylk, #919)
- Documentation improvements ( @dsquareindia & @tmylk, #914, #906 )
- Add Annoy memory-mapping example (@harshul1610, #899)
Dynamic Topic Modelling in Python from Google Summer of Code. Breaking changes in hdp and dtm models. +15 changes.
0.13.2, 2016-08-19
- wordtopics has changed to word_topics in ldamallet, and fixed issue #764. (@bhargavvader, #771)
- assigning wordtopics value of word_topics to keep backward compatibility, for now
- topics, topn parameters changed to num_topics and num_words in show_topics() and print_topics()(@droudy, #755)
- In hdpmodel and dtmmodel
- NOT BACKWARDS COMPATIBLE!
- Added random_state parameter to LdaState initializer and check_random_state() (@droudy, #113)
- Topic coherence update with
c_uci
,c_npmi
measures. LdaMallet, LdaVowpalWabbit support. Addtopics
parameter to coherencemodel. Can now provide tokenized topics to calculate coherence value. Faster backtracking. (@dsquareindia, #750, #793) - Added a check for empty (no words) documents before starting to run the DTM wrapper if model = "fixed" is used (DIM model) as this causes the an error when such documents are reached in training. (@Eickho, #806)
- New parameters
limit
,datatype
for load_word2vec_format();lockf
for intersect_word2vec_format (@gojomo, #817) - Changed
use_lowercase
option in word2vec accuracy tocase_insensitive
to account for case variations in training vocabulary (@jayantj, #804 - Link to Doc2Vec on airline tweets example in tutorials page (@544895340 , #823)
- Small error on Doc2vec notebook tutorial (@charlessutton, #816)
- Bugfix: Full2sparse clipped to use abs value (@tmylk, #811)
- WMD docstring: add tutorial link and query example (@tmylk, #813)
- Annoy integration to speed word2vec and doc2vec similarity. Tutorial update (@droudy, #799,#792 )
- Add converter of LDA model between Mallet, Vowpal Wabit and gensim (@dsquareindia, #798, #766)
- Distributed LDA in different network segments without broadcast (@menshikh-iv , #782)
- Update Corpora_and_Vector_Spaces.ipynb (@megansquire, #772)
- DTM wrapper bug fixes caused by renaming num_words in #755 (@bhargavvader, #770)
- Add LsiModel.docs_processed attribute (@hobson, #763)
- Dynamic Topic Modelling in Python. Google Summer of Code 2016 project. (@bhargavvader, #739, #831)
Topic Coherence
Initial release of Topic Coherence C_v and U_mass. More work will be done here but external API will remain the same.
Tutorials reworked, Word Movers Distance
0.12.5, 2016
Tutorials migrated from website to ipynb (@j9chan, #721), (@jesford, #733, #725, 716)
New doc2vec intro tutorial (@seanlaw, #730)
Gensim Quick Start Tutorial (@andrewjlm, #727)
Add export_phrases(sentences) to model Phrases (hanabi1224 #588)
SparseMatrixSimilarity returns a sparse matrix if maintain_sparsity is True (@davechallis, #590)
added functionality for Topics of Words in document - i.e, dynamic topics. (@bhargavvader, #704)
also included tutorial which explains new functionalities, and document word-topic coloring.
Made normalization an explicit transformation. Added 'l1' norm support (@sQuareindia, #649)
added term-topics API for most probable topic for word in vocab. (@bhargavvader, #706)
build_vocab takes progress_per parameter for smaller output (@zer0n, #624)
Control whether to use lowercase for computing word2vec accuracy. (@alantian, #607)
Easy import of GloVe vectors using Gensim (Manas Ranjan Kar, #625)
Allow easy port of GloVe vectors into Gensim
Standalone script with command line arguments, compatible with Python>=2.6
Usage: python -m gensim.scripts.glove2word2vec -i glove_vectors.txt -o output_word2vec_compatible.txt
Add similar_by_word() and similar_by_vector() to word2vec (@isohyt, #381)
Convenience method for similarity of two out of training sentences to doc2vec (@ellolo, #707)
Dynamic Topic Modelling Tutorial updated with Dynamic Influence Model (@bhargavvader, #689)
Added function to filter 'n' most frequent words from the dictionary (@abhinavchawla, #718)
Raise warnings if vocab is single character elements and if alpha is increased in word2vec/doc2vec (@dsquareindia, #705)
Tests for wikidump (@jonmcoe, #723)
Mallet wrapper sparse format support (@RishabGoel, #664)
Doc2vec pre-processing script translated from bash to Python (@andrewjlm, #720)
Added Distance Metrics to matutils.pt (@bhargavvader, #656)
0.13.0rc1 Tutorials reworked, Word Movers Distance
Changes
0.12.5, 2016
- Tutorials migrated from website to ipynb (@j9chan, #721), (@jesford, #733, #725, 716)
- New doc2vec intro tutorial (@seanlaw, #730)
- Gensim Quick Start Tutorial (@andrewjlm, #727)
- Add export_phrases(sentences) to model Phrases (hanabi1224 #588)
- SparseMatrixSimilarity returns a sparse matrix if
maintain_sparsity
is True (@davechallis, #590) - added functionality for Topics of Words in document - i.e, dynamic topics. (@bhargavvader, #704)
- also included tutorial which explains new functionalities, and document word-topic coloring.
- Made normalization an explicit transformation. Added 'l1' norm support (@sQuareindia, #649)
- added term-topics API for most probable topic for word in vocab. (@bhargavvader, #706)
- build_vocab takes progress_per parameter for smaller output (@zer0n, #624)
- Control whether to use lowercase for computing word2vec accuracy. (@alantian, #607)
- Easy import of GloVe vectors using Gensim (Manas Ranjan Kar, #625)
- Allow easy port of GloVe vectors into Gensim
- Standalone script with command line arguments, compatible with Python>=2.6
- Usage: python -m gensim.scripts.glove2word2vec -i glove_vectors.txt -o output_word2vec_compatible.txt
- Add
similar_by_word()
andsimilar_by_vector()
to word2vec (@isohyt, #381) - Convenience method for similarity of two out of training sentences to doc2vec (@ellolo, #707)
- Dynamic Topic Modelling Tutorial updated with Dynamic Influence Model (@bhargavvader, #689)
- Added function to filter 'n' most frequent words from the dictionary (@abhinavchawla, #718)
- Raise warnings if vocab is single character elements and if alpha is increased in word2vec/doc2vec (@dsquareindia, #705)
- Tests for wikidump (@jonmcoe, #723)
- Mallet wrapper sparse format support (@RishabGoel, #664)
- Doc2vec pre-processing script translated from bash to Python (@andrewjlm, #720)