Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation for *2Vec models #1944

Closed
wants to merge 43 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
feb3c32
Remove useless methods
steremma Feb 28, 2018
52eb1b3
started working on docstrings
steremma Feb 28, 2018
cb7b71a
more work done
steremma Feb 28, 2018
347cdb0
Finished documentation for the `BaseWordEmbeddingsModel
steremma Mar 1, 2018
327afc5
PEP-8
steremma Mar 1, 2018
bb8e3a3
Revert "Remove useless methods"
steremma Mar 2, 2018
7e89ca9
added documentation for the class and all its helper methods
steremma Mar 5, 2018
e0fe665
remove duplicated type info
steremma Mar 5, 2018
8aa85bc
Added documentation for `Doc2vec` model and all its helper methods
steremma Mar 5, 2018
7c74a4c
Fixed paper references and added documentation for –Doc2VecVocab
steremma Mar 6, 2018
e92b9b4
Fixed paper references
steremma Mar 6, 2018
9093eab
minor referencing fixes
steremma Mar 6, 2018
c07afa4
sphinx identation
steremma Mar 6, 2018
4a14a3e
Added docstrings for the private methods in `BaseAny2Vec`
steremma Mar 8, 2018
a7f3f0e
Applied all code review corrections, example fix still pending
steremma Mar 8, 2018
69d524d
Added missing docstrings
steremma Mar 12, 2018
4707c37
Fixed `int {1, 0}` -> `{1, 0}`
steremma Mar 12, 2018
3a85ac5
Fixed examples and code review corrections
steremma Mar 12, 2018
f041cf1
Fixed examples and applied code review corrections (optional argument…
steremma Mar 14, 2018
8badb81
Applied code review corrections and added top level usage examples
steremma Mar 14, 2018
8a8e1fb
Added high level explanation of the class hierarchy, fixed code revie…
steremma Mar 14, 2018
a5dced2
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
steremma Mar 19, 2018
535dc15
Final identation fixes
steremma Mar 19, 2018
1cc8889
Documentation fixes
steremma Mar 19, 2018
add686e
Fixed all examples
steremma Mar 19, 2018
7cb408c
delete redundant reference to module
steremma Mar 20, 2018
5b6d815
Added explanation for all important class attributes. These include s…
steremma Mar 21, 2018
f58e9a2
documented public cython functions
steremma Mar 29, 2018
6570cef
documented public cython functions in doc2vec
steremma Mar 29, 2018
0e8d299
Applied code review corrections
steremma Mar 30, 2018
86a6d23
added documentation for public cython methods in `fasttext`
steremma Apr 2, 2018
dc2f93e
added documentation for C functions in the word2vec
steremma Apr 3, 2018
f78348f
fix build issues
menshikh-iv Apr 11, 2018
cec8c44
add missing rst
menshikh-iv Apr 11, 2018
585f81f
fix base_any2vec
menshikh-iv Apr 13, 2018
b5d84ff
fix doc2vec[1]
menshikh-iv Apr 13, 2018
6f32e78
fix doc2vec[2]
menshikh-iv Apr 13, 2018
2e3a0b7
fix doc2vec[3ъ
menshikh-iv Apr 13, 2018
297b48e
Merge branch 'develop' into document-any2vec
menshikh-iv Apr 13, 2018
2d9616c
fix doc2vec[4]
menshikh-iv Apr 18, 2018
2fcd2f1
fix doc2vec_inner + remove unused imports
menshikh-iv Apr 18, 2018
7cbbac9
fix fasttext[1]
menshikh-iv Apr 18, 2018
0e9e6c5
reformat example sections
menshikh-iv Apr 18, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Modules:
models/word2vec
models/keyedvectors
models/doc2vec
models/doc2vec_inner
models/fasttext
models/phrases
models/poincare
Expand All @@ -64,6 +65,7 @@ Modules:
models/deprecated/word2vec
models/deprecated/keyedvectors
models/deprecated/fasttext_wrapper
models/base_any2vec
similarities/docsim
similarities/index
sklearn_api/atmodel
Expand Down
10 changes: 10 additions & 0 deletions docs/src/models/base_any2vec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:mod:`models.base_any2vec` -- Base classes for any2vec models
=============================================================

.. automodule:: gensim.models.base_any2vec
:synopsis: Base classes for any2vec models
:members:
:inherited-members:
:special-members: __getitem__
:undoc-members:
:show-inheritance:
9 changes: 9 additions & 0 deletions docs/src/models/doc2vec_inner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
:mod:`models.doc2vec_inner` -- Cython job for training Doc2Vec model
====================================================================

.. automodule:: gensim.models.doc2vec_inner
:synopsis: Cython job for training Doc2Vec model
:members:
:inherited-members:
:undoc-members:
:show-inheritance:
691 changes: 620 additions & 71 deletions gensim/models/base_any2vec.py

Large diffs are not rendered by default.

839 changes: 616 additions & 223 deletions gensim/models/doc2vec.py

Large diffs are not rendered by default.

2,298 changes: 1,132 additions & 1,166 deletions gensim/models/doc2vec_inner.c

Large diffs are not rendered by default.

146 changes: 137 additions & 9 deletions gensim/models/doc2vec_inner.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@
# cython: boundscheck=False
# cython: wraparound=False
# cython: cdivision=True
# cython: embedsignature=True
# coding: utf-8
#
# Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""Optimized cython functions for training :class:`~gensim.models.doc2vec.Doc2Vec` model."""
import cython
import numpy as np
from numpy import zeros, float32 as REAL
cimport numpy as np

from libc.math cimport exp
from libc.string cimport memset, memcpy

# scipy <= 0.15
Expand All @@ -22,13 +22,7 @@ except ImportError:
# in scipy > 0.15, fblas function has been removed
import scipy.linalg.blas as fblas

from word2vec_inner cimport bisect_left, random_int32, \
scopy, saxpy, sdot, dsdot, snrm2, sscal, \
REAL_t, EXP_TABLE, \
our_dot, our_saxpy, \
our_dot_double, our_dot_float, our_dot_noblas, our_saxpy_noblas

from word2vec import FAST_VERSION
from word2vec_inner cimport bisect_left, random_int32, sscal, REAL_t, EXP_TABLE, our_dot, our_saxpy

DEF MAX_DOCUMENT_LEN = 10000

Expand Down Expand Up @@ -227,6 +221,50 @@ cdef unsigned long long fast_document_dmc_neg(
def train_document_dbow(model, doc_words, doctag_indexes, alpha, work=None,
train_words=False, learn_doctags=True, learn_words=True, learn_hidden=True,
word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
"""Update distributed bag of words model ("PV-DBOW") by training on a single document.

Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
:meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`.

Parameters
----------
model : :class:`~gensim.models.doc2vec.Doc2Vec`
The model to train.
doc_words : list of str
The input document as a list of words to be used for training. Each word will be looked up in
the model's vocabulary.
doctag_indexes : list of int
Indices into `doctag_vectors` used to obtain the tags of the document.
alpha : float
Learning rate.
work : list of float, optional
Updates to be performed on each neuron in the hidden layer of the underlying network.
train_words : bool, optional
Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both** `learn_words`
and `train_words` are set to True.
learn_doctags : bool, optional
Whether the tag vectors should be updated.
learn_words : bool, optional
Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
`learn_words` and `train_words` are set to True.
learn_hidden : bool, optional
Whether or not the weights of the hidden layer will be updated.
word_vectors : numpy.ndarray, optional
The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
word_locks : numpy.ndarray, optional
A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates,
a value of 1 allows to update word-vectors.
doctag_vectors : numpy.ndarray, optional
Vector representations of the tags. If None, these will be retrieved from the model.
doctag_locks : numpy.ndarray, optional
The lock factors for each tag, same as `word_locks`, but for document-vectors.

Returns
-------
int
Number of words in the input document that were actually used for training.

"""
cdef int hs = model.hs
cdef int negative = model.negative
cdef int sample = (model.vocabulary.sample != 0)
Expand Down Expand Up @@ -363,6 +401,51 @@ def train_document_dbow(model, doc_words, doctag_indexes, alpha, work=None,
def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
learn_doctags=True, learn_words=True, learn_hidden=True,
word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
"""Update distributed memory model ("PV-DM") by training on a single document.
This method implements the DM model with a projection (input) layer that is either the sum or mean of the context
vectors, depending on the model's `dm_mean` configuration field.

Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
:meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`.

Parameters
----------
model : :class:`~gensim.models.doc2vec.Doc2Vec`
The model to train.
doc_words : list of str
The input document as a list of words to be used for training. Each word will be looked up in
the model's vocabulary.
doctag_indexes : list of int
Indices into `doctag_vectors` used to obtain the tags of the document.
alpha : float
Learning rate.
work : np.ndarray, optional
Private working memory for each worker.
neu1 : np.ndarray, optional
Private working memory for each worker.
learn_doctags : bool, optional
Whether the tag vectors should be updated.
learn_words : bool, optional
Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
`learn_words` and `train_words` are set to True.
learn_hidden : bool, optional
Whether or not the weights of the hidden layer will be updated.
word_vectors : numpy.ndarray, optional
The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
word_locks : numpy.ndarray, optional
A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates,
a value of 1 allows to update word-vectors.
doctag_vectors : numpy.ndarray, optional
Vector representations of the tags. If None, these will be retrieved from the model.
doctag_locks : numpy.ndarray, optional
The lock factors for each tag, same as `word_locks`, but for document-vectors.

Returns
-------
int
Number of words in the input document that were actually used for training.

"""
cdef int hs = model.hs
cdef int negative = model.negative
cdef int sample = (model.vocabulary.sample != 0)
Expand Down Expand Up @@ -521,6 +604,51 @@ def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=N
def train_document_dm_concat(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
learn_doctags=True, learn_words=True, learn_hidden=True,
word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
"""Update distributed memory model ("PV-DM") by training on a single document, using a concatenation of the context
window word vectors (rather than a sum or average).
This might be slower since the input at each batch will be significantly larger.

Called internally from :meth:`~gensim.models.doc2vec.Doc2Vec.train` and
:meth:`~gensim.models.doc2vec.Doc2Vec.infer_vector`.

Parameters
----------
model : :class:`~gensim.models.doc2vec.Doc2Vec`
The model to train.
doc_words : list of str
The input document as a list of words to be used for training. Each word will be looked up in
the model's vocabulary.
doctag_indexes : list of int
Indices into `doctag_vectors` used to obtain the tags of the document.
alpha : float, optional
Learning rate.
work : np.ndarray, optional
Private working memory for each worker.
neu1 : np.ndarray, optional
Private working memory for each worker.
learn_doctags : bool, optional
Whether the tag vectors should be updated.
learn_words : bool, optional
Word vectors will be updated exactly as per Word2Vec skip-gram training only if **both**
`learn_words` and `train_words` are set to True.
learn_hidden : bool, optional
Whether or not the weights of the hidden layer will be updated.
word_vectors : numpy.ndarray, optional
The vector representation for each word in the vocabulary. If None, these will be retrieved from the model.
word_locks : numpy.ndarray, optional
A learning lock factor for each weight in the hidden layer for words, value 0 completely blocks updates,
a value of 1 allows to update word-vectors.
doctag_vectors : numpy.ndarray, optional
Vector representations of the tags. If None, these will be retrieved from the model.
doctag_locks : numpy.ndarray, optional
The lock factors for each tag, same as `word_locks`, but for document-vectors.

Returns
-------
int
Number of words in the input document that were actually used for training.

"""
cdef int hs = model.hs
cdef int negative = model.negative
cdef int sample = (model.vocabulary.sample != 0)
Expand Down
Loading