Christmas Come Early
3.2.0, 2017-12-09
🌟 New features:
-
New download API for corpora and pre-trained models (@chaitaliSaini & @menshikh-iv, #1705 & #1632 & #1492)
- Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:
import gensim.downloader as api for article in api.load("wiki-english-20171001"): print(article)
- Don’t waste time searching for good word embeddings, use the curated ones:
import gensim.downloader as api model = api.load("glove-twitter-25") model.most_similar("engineer") # [('specialist', 0.957542896270752), # ('developer', 0.9548177123069763), # ('administrator', 0.9432312846183777), # ('consultant', 0.93915855884552), # ('technician', 0.9368376135826111), # ('analyst', 0.9342101216316223), # ('architect', 0.9257484674453735), # ('engineering', 0.9159940481185913), # ('systems', 0.9123805165290833), # ('consulting', 0.9112802147865295)]
- Blog post introducing the API and design decisions.
- Jupyter notebook with examples
- Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:
-
New model: Poincaré embeddings (@jayantj, #1696 & #1700 & #1757 & #1734)
- Embed a graph (taxonomy) in the same way as word2vec embeds words:
from gensim.models.poincare import PoincareRelations, PoincareModel from gensim.test.utils import datapath data = PoincareRelations(datapath('poincare_hypernyms.tsv')) model = PoincareModel(data) model.kv.most_similar("cat.n.01") # [('kangaroo.n.01', 0.010581353439700418), # ('gib.n.02', 0.011171531439892076), # ('striped_skunk.n.01', 0.012025106076442395), # ('metatherian.n.01', 0.01246679759214648), # ('mammal.n.01', 0.013281303506525968), # ('marsupial.n.01', 0.013941330203709653)]
- Tutorial on Poincaré embeddings (Jupyter notebook).
- Model introduction and the journey of its implementation (blog post).
- Original paper on arXiv.
- Embed a graph (taxonomy) in the same way as word2vec embeds words:
-
Optimized FastText (@manneshiva, #1742)
- New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.
import gensim.downloader as api from gensim.models import FastText model = FastText(api.load("text8")) model.most_similar("cat") # [('catnip', 0.8538144826889038), # ('catwalk', 0.8136177062988281), # ('catchy', 0.7828493118286133), # ('caf', 0.7826495170593262), # ('bobcat', 0.7745151519775391), # ('tomcat', 0.7732658386230469), # ('moat', 0.7728310823440552), # ('caye', 0.7666271328926086), # ('catv', 0.7651021480560303), # ('caveat', 0.7643581628799438)]
- New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.
-
Binary pre-compiled wheels for Windows, OSX and Linux (@menshikh-iv, MacPython/gensim-wheels/#7)
- Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, fasttext etc.
- Faster Gensim pip installation
-
Added
DeprecationWarnings
to deprecated methods and parameters, with a clear schedule for removal.
👍 Improvements:
- Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (@PeteBleackley, #1738)
- Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (@horpto, #1689)
- Reduce distribution size. Fix #1698 (@menshikh-iv, #1699)
- Improve
scan_vocab
speed,build_vocab_from_freq
method (@jodevak, #1695) - Improve
segment_wiki
script (@piskvorky, #1707) - Add custom
dtype
support forLdaModel
. Partially fix #1576 (@xelez, #1656) - Add
doc2idx
method forgensim.corpora.Dictionary
. Fix #1634 (@roopalgarg, #1720) - Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (@menshikh-iv, #1721)
- Add flag for hiding outdated data for
gensim.downloader.info
(@menshikh-iv, #1736) - Add reproducible order between Python versions for
gensim.corpora.Dictionary
(@formi23, #1715) - Update
tox.ini
,setup.cfg
,README.md
(@menshikh-iv, #1741) - Add optimized
logsumexp
forLdaModel
(@arlenk, #1745)
🔴 Bug fixes:
- Fix ranking formula in
gensim.summarization.bm25
. Fix #1718 (@souravsingh, #1726) - Fixed incompatibility in persistence for
FastText
wrapper. Fix #1642 (@chinmayapancholi13, #1723) - Fix
gensim.sklearn_api
bug withdocuments_columns
parameter. Fix #1676 (@chinmayapancholi13, #1704) - Fix slowdown of CI, remove pytest-cov (@menshikh-iv, #1728)
- Replace outdated packages in Dockerfile (@rbahumi, #1730)
- Replace
num_words
totopn
inLdaMallet.show_topics
. Fix #1747 (@apoorvaeternity, #1749) - Fix
os.rename
fromgensim.downloader
when 'src' and 'dst' on different partitions (@anotherbugmaster, #1733) - Fix
DeprecationWarning
fromlogsumexp
(@dreamgonfly, #1703) - Fix backward compatibility problem in
Phrases.load
. Fix #1751 (@alexgarel, #1758) - Fix
load_word2vec_format
fromFastText
. Fix #1743 (@manneshiva, #1755) - Fix ipython kernel version in
Dockerfile
. Fix #1762 (@rbahumi, #1764) - Fix writing in
segment_wiki
(@horpto, #1763) - Fix write method of file requires byte-like object in
segment_wiki
(@horpto, #1750) - Fix incorrect vectors learned during online training for
FastText
. Fix #1752 (@manneshiva, #1756) - Fix
dtype
ofmodel.wv.syn0_vocab
on updatingvocab
forFastText
. Fix #1759 (@manneshiva, #1760) - Fix hashing-trick from
FastText.build_vocab
. Fix #1765 (@manneshiva, #1768) - Add explicit
DeprecationWarning
for all outdated stuff. Fix #1753 (@menshikh-iv, #1769) - Fix epsilon according to
dtype
inLdaModel
(@menshikh-iv, #1770)
📚 Tutorial and doc improvements:
- Update perf numbers of
segment_wiki
(@piskvorky, #1708) - Update docstring for
gensim.summarization.summarize
. Fix #1575 (@fbarrios, #1702) - Refactor API Reference for
gensim.parsing
. Fix #1664 (@CLearERR, #1684) - Fix typos in doc2vec-wikipedia notebook (@youqad, #1727)
- Fix PyPI long description rendering (@edigaryev, #1739)
- Fix twitter badge src (@menshikh-iv)
- Fix maillist badge color (@menshikh-iv)
-
Remove
gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
-
Move
gensim.scripts.make_wikicorpus
➡gensim.scripts.make_wiki.py
gensim.summarization
➡gensim.models.summarization
gensim.topic_coherence
➡gensim.models._coherence
gensim.utils
➡gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡gensim.utils.text_utils