Skip to content

Commit

Permalink
Implement Okapi BM25 variants in Gensim (#3304)
Browse files Browse the repository at this point in the history
* Add and unit-test gensim.models.bm25model.OkapiBM25Model

* Document gensim.models.bm25

* Add and unit-test gensim.models.bm25model.{Lucene,Atire}BM25Model

* Add normalize_{queries,documents} params to gensim.similarities.docsim

* Add example of BM25 to gensim.similarities.docsim.SparseMatrixSimilarity

* Refresh stale gallery cache

* Update gensim/models/bm25model.py

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>

Co-authored-by: Radim Řehůřek <me@radimrehurek.com>
  • Loading branch information
Witiko and piskvorky authored Sep 8, 2022
1 parent ff3531b commit 5dbfb1e
Show file tree
Hide file tree
Showing 13 changed files with 907 additions and 43 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions docs/src/auto_examples/core/run_topics_and_transformations.ipynb

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions docs/src/auto_examples/core/run_topics_and_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,20 @@
#
# model = models.TfidfModel(corpus, normalize=True)
#
# * `Okapi Best Matching, Okapi BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_
# expects a bag-of-words (integer values) training corpus during initialization.
# During transformation, it will take a vector and return another vector of the
# same dimensionality, except that features which were rare in the training corpus
# will have their value increased. It therefore converts integer-valued
# vectors into real-valued ones, while leaving the number of dimensions intact.
#
# Okapi BM25 is the standard ranking function used by search engines to estimate
# the relevance of documents to a given search query.
#
# .. sourcecode:: pycon
#
# model = models.OkapiBM25Model(corpus)
#
# * `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
# transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
# a latent space of a lower dimensionality. For the toy corpus above we used only
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
f49c3821bbacdeefdf3945d5dcb5ad01
226db24f9e807e4bbd2a6ef280a75510
150 changes: 132 additions & 18 deletions docs/src/auto_examples/core/run_topics_and_transformations.rst

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/src/auto_examples/core/sg_execution_times.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

Computation times
=================
**00:05.212** total execution time for **auto_examples_core** files:
**00:01.658** total execution time for **auto_examples_core** files:

+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:05.212 | 47.2 MB |
| :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``) | 00:01.658 | 58.1 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``) | 00:00.000 | 0.0 MB |
| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
| :ref:`sphx_glr_auto_examples_core_run_topics_and_transformations.py` (``run_topics_and_transformations.py``) | 00:00.000 | 0.0 MB |
| :ref:`sphx_glr_auto_examples_core_run_similarity_queries.py` (``run_similarity_queries.py``) | 00:00.000 | 0.0 MB |
+--------------------------------------------------------------------------------------------------------------+-----------+---------+
4 changes: 2 additions & 2 deletions docs/src/auto_examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the WMD.">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">

.. only:: html

Expand All @@ -237,7 +237,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the WMD.">

.. only:: html

Expand Down
14 changes: 14 additions & 0 deletions docs/src/gallery/core/run_topics_and_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,20 @@
#
# model = models.TfidfModel(corpus, normalize=True)
#
# * `Okapi Best Matching, Okapi BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_
# expects a bag-of-words (integer values) training corpus during initialization.
# During transformation, it will take a vector and return another vector of the
# same dimensionality, except that features which were rare in the training corpus
# will have their value increased. It therefore converts integer-valued
# vectors into real-valued ones, while leaving the number of dimensions intact.
#
# Okapi BM25 is the standard ranking function used by search engines to estimate
# the relevance of documents to a given search query.
#
# .. sourcecode:: pycon
#
# model = models.OkapiBM25Model(corpus)
#
# * `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
# transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
# a latent space of a lower dimensionality. For the toy corpus above we used only
Expand Down
1 change: 1 addition & 0 deletions gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .ldamodel import LdaModel # noqa:F401
from .lsimodel import LsiModel # noqa:F401
from .tfidfmodel import TfidfModel # noqa:F401
from .bm25model import OkapiBM25Model, LuceneBM25Model, AtireBM25Model # noqa:F401
from .rpmodel import RpModel # noqa:F401
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec, FAST_VERSION # noqa:F401
Expand Down
Loading

0 comments on commit 5dbfb1e

Please sign in to comment.