Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation for *2Vec models #1944

Closed
wants to merge 43 commits into from

Conversation

steremma
Copy link
Contributor

@steremma steremma commented Mar 1, 2018

In the end this PR will fix the documentation for BaseWordEmbeddingsModel, BaseAny2Vec, Word2Vec and possible Doc2Vec. At the moment of opening only the first is done, however its useful to keep it online to get feedback as early as possible

@@ -288,18 +281,67 @@ class BaseWordEmbeddingsModel(BaseAny2VecModel):

"""

def _clear_post_train(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change the code in current PR, only docstrings.

@menshikh-iv
Copy link
Contributor

Great start @steremma, please continue!

@somnathrakshit
Copy link

Hey @steremma can I help you with the documentation? I have been working on the same.

@steremma
Copy link
Contributor Author

steremma commented Mar 5, 2018

Hello @somnathrakshit

I believe the PR is almost finished, however I am unsure about certain argument in some of the helper methods, specifically those that are conditionally defined in word2vec and doc2vec when the cython implementation is missing.

It would be nice if you could take a look at the type and description I gave for these arguments and let me know if you disagree or if you can improve any of them.

For more specific info feel free to ping me in gitter so that we don't spam this discussion.

Another way to cooperate might be for you to submit PRs to my fork directly instead of telling me what to change.I think in this way, after merging you will get credit for your contribution as well.
@menshikh-iv is that correct?

@menshikh-iv
Copy link
Contributor

@CLearERR @anotherbugmaster @yurkai - guys, please discuss with @steremma how to make docstring guideline better (we need our contribution documentation guide)

"""
Base class containing common methods for training, using & evaluating word embeddings learning models.
For example - `Word2Vec`, `FastText`, etc.
"""Base class containing common methods for training, using & evaluating word embeddings learning models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about docstrings for BaseAny2VecModel.

Also, don't forget about _ methods, this is really important for persons, who will add some *2vec implementations in future.

Can be simply a list of lists of tokens, but for larger corpora,
consider an iterable that streams the sentences directly from disk/network.
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention :mod: here (because you already mentioned concrete class)

consider an iterable that streams the sentences directly from disk/network.
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.
workers : int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost all arguments is optional, no?


Parameters
----------
sentences : iterable of iterable of str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable of list of str better (here and everywhere)

List of callbacks that need to be executed/run at specific stages during training.
batch_words : int
Number of words to be processed by a single job.
trim_rule : function, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to describe function signature (in the description of the parameter). I see (word, count, min_count), but I need types here + show it more explicitly (maybe new sentence)

keep_raw_vocab : bool
If not true, delete the raw vocabulary after the scaling is done and free up RAM.
corpus_count : int
word_freq : dict of (unicode str, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of (str, int)


Returns
-------
dict of (str, int), optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How Returns can be optional?

Examples
--------

#. Initialize a model with e.g.::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to fix example (this should works fine + demonstrate more methods)

Obtain likelihood score for a single sentence in a fitted skip-gram representaion.
The sentence is a list of Vocab objects (or None, when the corresponding
word is not in the vocabulary). Called internally from `Word2Vec.score()`.
Obtain likelihood score for a single sentence in a fitted skip-gram representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be

"""Obtain ...

here and everywhere

@@ -512,32 +648,49 @@ def train(self, documents, total_examples=None, total_words=None,
queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)

def _raw_word_count(self, job):
"""Return the number of words in a given job."""
"""Return the number of words in a given job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use always Get instead of Return in docstring (here and everywhere)

@menshikh-iv menshikh-iv changed the title [DNM] Document any2vec Refactor documentation for *2Vec models. Mar 14, 2018
@menshikh-iv menshikh-iv changed the title Refactor documentation for *2Vec models. Refactor documentation for *2Vec models Mar 14, 2018
@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Mar 14, 2018
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general - looks awesome! Good work 👍

Missed things (for current moment)

  • Examples for d2v/w2v
  • Fasttext
  • Poincare
  • Docstrings for .pyx files (this will be done in last order, we'll discuss it later)

@@ -28,18 +28,40 @@

class BaseAny2VecModel(utils.SaveLoad):
"""Base class for training, using and evaluating any2vec model.
Contains implementation for multi-threaded training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix description in the head of current file please (make it more detailed, links to several related classes, etc).

A subclass should initialize the following attributes:
- self.kv (instance of concrete implementation of `BaseKeyedVectors` interface)
- self.vocabulary (instance of concrete implementation of `BaseVocabBuilder` abstract class)
- self.trainables (instance of concrete implementation of `BaseTrainables` abstract class)

Parameters
----------
workers : int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional parameters (here and everywhere)

Parameters
----------
workers : int
Number of working threads, used for multiprocessing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multithreading :)

----------
job_queue : Queue of (list of object, dict)
A queue of jobs still to be processed. The worker will take up jobs from this queue.
Each job is represented by a tuple where the first element is the corpus chunk to be processed and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add "toy" example, how looks element of queue

Parameters
----------
data_iterator : iterable of list of object
The input corpus. This will be split in chunks and these chunks will be pushed to the queue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not always corpus

----------
data_iterator : iterable of list of object
The input corpus. This will be split in chunks and these chunks will be pushed to the queue.
job_queue : Queue of (list of object, dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of ? here and everywhere

Multiplier for size of queue -> size = number of workers * queue_factor.
report_delay : float, optional
Number of seconds between two consecutive progress report messages in the logger.
callbacks : list of :class: `~gensim.models.callbacks.CallbackAny2Vec`, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add space between :class: and link to model, please check all in rendered documentation (how this looks), this is important

@@ -526,7 +806,7 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No
len(raw_vocab), sum(itervalues(raw_vocab))
)

# Since no sentences are provided, this is to control the corpus_count
# Since no sentences are provided, this is to control the `corpus_count`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this have no sense to use in comments (that presented as# ... `)

Returns
-------
(np.ndarray, np.ndarray)
Each worker threads private work memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget about empty line at the end of each docstring (except one-line docstrings), here and everywhere

----------
model : :class:`~gensim.models.word2Vec.Word2Vec`
The Word2Vec model instance to train.
sentences : iterable of iterable of str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable of list of str better (here and everywhere)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 14, 2018

@steremma If you'll finish with all mentioned stuff, some hints about docstrings for .pyx

  1. Add special flag to head of file like here https://github.com/RaRe-Technologies/gensim/blob/985d552a1ca37b96629cbc1037100fa0a21f5ba5/gensim/_matutils.pyx#L3
  2. Write docstrings for all methods in .pyx files
  3. cython path/to/your/file, please use cython>=0.27 for this proposes
  4. Add needed *.rst file for this file
  5. tox -e docs (also, you need to add it to apiref.rst of course)

steremma and others added 4 commits March 20, 2018 11:44
…ome intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory.
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @steremma, you are one of the best persons who worked on documentation 🔥

So, what're additional things should be done before a merge

-----
Even though this is the usual case, not all embeddings transform text.
For example :class:`~gensim.models.poincare.PoincareModel` operates on graph representations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add See also section here and mention several concrete implementations as w2v, fasttext, etc.


"""

def __init__(self, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000):
"""Initialize model parameters.

Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better place for this notes is below (in class docstring instead of __init__ docstring)

------
IOError
When methods are called on instance (should be called from class).
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing emptyline at the end of docstring (here and everywhere)

Parameters
----------
job_params : dict of (str, obj)
Unused. TODO: Delete this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write something like UNUSED. (without TODO)


>>> from gensim.test.utils import common_texts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add \t before >>> (here and everywhere)

return 60 * len(self.docvecs.offset2doctag) + 140 * len(self.docvecs.doctags)

def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5):
"""
Infer a vector for given post-bulk training document.
"""Infer a vector for given post-bulk training document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a warning about "this infer new vectors each time, it's fine that you retrieve different vectors for the same document, increate steps for more stable representations (and similar but short tip for steps parameter)

@@ -45,6 +81,7 @@
logger = logging.getLogger(__name__)

try:
raise ImportError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make sure that doctests work even without numpy installed so I did this hack to force my computer to use the python implementation. And then i forgot to remove it!

@@ -519,20 +601,31 @@ def train(self, sentences, total_examples=None, total_words=None,
self.trainables.get_vocab_word_vecs(self.wv)

def init_sims(self, replace=False):
"""
"""Deletes the keyed vector syn1 structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quiet

@@ -73,22 +74,32 @@ class PoincareModel(utils.SaveLoad):
and :meth:`~gensim.models.poincare.PoincareModel.load` methods, or stored/loaded in the word2vec format
via `model.kv.save_word2vec_format` and :meth:`~gensim.models.poincare.PoincareKeyedVectors.load_word2vec_format`.

Note that training cannot be resumed from a model loaded via `load_word2vec_format`, if you wish to train further,
Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about examples here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have example for each method in the PoincareKeyedVectors class.

"""Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`."""
"""Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`.

See also :meth:`~gensim.models.poincare.PoincareModel.save`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all "See also" to section

See also
------------
...

here and everywhere

@menshikh-iv menshikh-iv added the RFM label Apr 3, 2018
@piskvorky
Copy link
Owner

PR continued in #2087.

@piskvorky piskvorky closed this Jun 14, 2018
menshikh-iv pushed a commit that referenced this pull request Jun 20, 2018
* Remove useless methods

* started working on docstrings

* more work done

* Finished documentation for the `BaseWordEmbeddingsModel

* PEP-8

* Revert "Remove useless methods"

This reverts commit feb3c32.

* added documentation for the  class and all its helper methods

* remove duplicated type info

* Added documentation for `Doc2vec` model and all its helper methods

* Fixed paper references and added documentation for �Doc2VecVocab

* Fixed paper references

* minor referencing fixes

* sphinx identation

* Added docstrings for the private methods in `BaseAny2Vec`

* Applied all code review corrections, example fix still pending

* Added missing docstrings

* Fixed `int {1, 0}` -> `{1, 0}`

* Fixed examples and code review corrections

* Fixed examples and applied code review corrections (optional arguments, correct types, blank lines at end of docstrings

* Applied code review corrections and added top level usage examples

* Added high level explanation of the class hierarchy, fixed code review corrections

* Final identation fixes

* Documentation fixes

* Fixed all examples

* delete redundant reference to module

* Added explanation for all important class attributes. These include some intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory.

* documented public cython functions

* documented public cython functions in doc2vec

* Applied code review corrections

* added documentation for public cython methods in `fasttext`

* added documentation for C functions in the word2vec

* fix build issues

* add missing rst

* fix base_any2vec

* fix doc2vec[1]

* fix doc2vec[2]

* fix doc2vec[3ъ

* fix doc2vec[4]

* fix doc2vec_inner + remove unused imports

* fix fasttext[1]

* reformat example sections

* word2vec doc fixes

* more doc fixes

* merging in changes from #1944

* review docs for doc2vec, base_any2vec

* review fasttext docs

* review poincare docs

* minor typo fixes

* simplify word2vec.train() docs

* update alpha & epoch docs for *2vec models

* add *_inner docs

* fixing KeyedVectors docs

* disable sphinx latex and errors (temporary, revert later)

* hyperlink fixes

* Fix build warnings

* fix flake8

* enable strict doc building

* embedsignature for w2v & ft

* yes/no -> ✅/❌

* cleanup base_any2vec

* clenup cython files

* cleanup doc2vec

* improve d2v example

* cleanup fasttext

* clenup utils_any2vec

* clenup poincare

* clenup keyedvectors

* cleanup word2vec

* add newline around module docstrings + re-generate *.c files (for correct doc building)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants