Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docstrings for gensim.models.AuthorTopicModel #1907

Merged
merged 16 commits into from
Apr 3, 2018
Merged

Fix docstrings for gensim.models.AuthorTopicModel #1907

merged 16 commits into from
Apr 3, 2018

Conversation

souravsingh
Copy link
Contributor

The PR aims to fix the docstrings for Author-topic model.

Fixes the documentation of Author-Topic model.

The PR aims to fix the docstrings for Author-topic model.
@souravsingh souravsingh changed the title Add docustrings for Author-topic model Add docstrings for Author-topic model Feb 14, 2018
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, please fix current problems & write docstrings for the next part of this file


Parameters
----------
doc2author: dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of ??? here and everywhere


Parameters
----------
corpus: list of list of str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe iterable of ...?

----------
eta: float
Dirichlet topic parameter for sparsity.
lambda_shape: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect type

@@ -107,96 +134,69 @@ def construct_author2doc(doc2author):


class AuthorTopicModel(LdaModel):
"""
The constructor estimates the author-topic model parameters based
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need end2end example of usage in docstring (i.e. I can copy-paste it and this should works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at the end of the docstring as a separate section.

----------
num_topic: int, optional
Number of topics to be extracted from the training corpus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needed empty lines between parameter definition (here and everywhere)

@@ -68,6 +68,18 @@ class AuthorTopicState(LdaState):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring for module (examples of usage, liks to related papers, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a usage example in the docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about related papers/links/etc? add one more example - almost always a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the paper on Author topic model to the top of the script.

@menshikh-iv
Copy link
Contributor

@souravsingh hi, have you some schedule for this file (some date when you plan to finish)?

@souravsingh
Copy link
Contributor Author

I will be making some final updates by today

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 20, 2018

You have many "TODO" now, good luck, I hope you will succeed today @souravsingh 👍

@souravsingh
Copy link
Contributor Author

I have pushed the changes for the rest of the functions. I would need a review for the changes before the patch can be merged.

Example
-------
>>> import numpy as np
>>> from gensim.models import AuthorTopicModel
>>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word) # train model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example doesn't work (try to copy-paste it to console and run), this must work.


Parameters
----------
expElogthetad: numpy.ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is parameter description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasnt sure about what to write for the description, which is why I left it blank.

@@ -379,6 +381,26 @@ def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, c
Avoids computing the `phi` variational parameter directly using the
optimization presented in **Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001**.

Parameters
----------
chunk: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk: int -> chunk : int (here and everywhere)

----------
chunk: int
The chunk numer of the sparse document vector on which inference needs to be done.
author2doc: dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of ???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a dict of str and str?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict of (type, type), check by tests, what's types should be here

@@ -332,6 +323,11 @@ def extend_corpus(self, corpus):
are added in the process. If serialization is not used, the corpus, as a list
of documents, is simply extended.

Parameters
----------
corpus: list of list of str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure?

@@ -68,6 +68,18 @@ class AuthorTopicState(LdaState):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about related papers/links/etc? add one more example - almost always a good idea

@menshikh-iv
Copy link
Contributor

@souravsingh ping me when you'll be ready with PR please

@souravsingh
Copy link
Contributor Author

@menshikh-iv The patch is ready for review. I will fix the flake8 issues once I take care of the comments.

@@ -58,7 +55,7 @@
class AuthorTopicState(LdaState):
"""
NOTE: distributed mode not available yet in the author-topic model. This AuthorTopicState
object is kept so that when the time comes to imlement it, it will be easier.
object is kept so that when the time comes to implement it, it will be easier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use

Notes
--------
....

Instead of NOTE: and similar things


Do not call this method directly, instead use `model[author_names]`.

Parameters
----------
author_names : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect type

@@ -950,10 +1085,22 @@ def __getitem__(self, author_names, eps=None):
Return topic distribution for input author as a list of
(topic_id, topic_probabiity) 2-tuples.

Ingores topics with probaility less than `eps`.
Ignores topics with probaility less than `eps`.

Do not call this method directly, instead use `model[author_names]`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be wrapped in Warnings section


Example
-------
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't work, you should define all variable & make all imports first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple test for example: I copy-paste it to REPL, run and this works successfully


Returns
-------
list of 2-tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be list of (some_type_1, some_type_2), here and everywhere

----------
chunk : int
The chunk numer of the sparse document vector on which inference needs to be done.
author2doc : dict of {str: list of ints}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use () instead of {} in type definition, here and everywhere

For other parameter settings, see :class:`AuthorTopicModel` constructor.
Parameters
----------
corpus : iterable of iterable of (int, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use iterable of list of (int, number), here and everywhere

Parameters
----------
expElogthetad: numpy.ndarray
Value of variational distribution q(theta|gamma).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use latex for this stuff (this will be rendered in documentation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the representation for this in LaTex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souravsingh look into original paper + :math: in sphinx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be like this- q(\theta|\gamma)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souravsingh maybe, I'm not sure, check original paper.

>>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word) # train model
>>> model.update(corpus2) # update the author-topic model with additional documents
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example must work, corpus, author2doc, id2word undefined here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, "update with additional", but you use exactly same corpus here, this confused

@menshikh-iv
Copy link
Contributor

@souravsingh please ping me when you'll be ready

@menshikh-iv
Copy link
Contributor

@souravsingh how is going? when you plan to finish PR?

@souravsingh
Copy link
Contributor Author

@menshikh-iv Ready for review

@@ -19,15 +19,12 @@
The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class
inherits the LdaModel class, and its usage is thus similar.

Distributed computation and multiprocessing is not implemented at the moment, but may be
coming in the future.
The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in [1]_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use [1] references (this doesn't works properly with one of our sphinx plugin), intead of, you should use something like, here and everywhere

The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in `The Author-Topic Model for Authors and Documents <https://arxiv.org/abs/1207.4169>`_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a good idea to add example of usage in current docstring (head of file)

----------
eta: float
Dirichlet topic parameter for sparsity.
lambda_shape: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect types (this is (int, int), 2d arrays)


Parameters
----------
corpus: list of list of str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable of list of str

----------
corpus: list of list of str
Corpus of documents.
author2doc: dict of (str: list of int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should looks like dict of (str, list of int), here and everywhere (for all dict of ..)


Model persistency is achieved through its `load`/`save` methods.
"""
"""The constructor estimates the author-topic model parameters based on a training corpus."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move example here (instead of __init__)

... }
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
>>> model = AuthorTopicModel(corpus, author2doc=author2doc, id2word=dictionary, num_topics=4, passes=100) # train model
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but how to apply model? This is main usecase.

@@ -306,6 +330,7 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d
self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy)

def __str__(self):
"""Return a string representation of AuthorTopicModel class."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, where is Returns section?

Parameters
----------
expElogthetad: numpy.ndarray
Value of variational distribution .. math:: q(theta|gamma).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect using of math (don't rendered)


Returns
-------
list of (topic_id, topic_probability) as a 2-tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of (int, float)

@menshikh-iv menshikh-iv changed the title Add docstrings for Author-topic model Fix docstrings for gensim.models.AuthorTopicModel Mar 9, 2018
@menshikh-iv
Copy link
Contributor

ping @souravsingh, how is going?

>>> model = AuthorTopicModel(corpus, author2doc=author2doc, id2word=dictionary, num_topics=4, passes=100) # train model
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]
>>> print(author_vecs) #Prints top authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't top authors, this is all authors.

----------
chunk : int
The chunk numer of the sparse document vector on which inference needs to be done.
author2doc : dict of (strm list of ints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ints -> int
strm -> str,

here and everywhere


Returns
-------
list of (int, float) as a 2-tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of (int, float)

@@ -19,14 +19,31 @@
The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be references to classes like

:class:`~gensim.models...`


A tutorial can be found at
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb.
The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a good idea to add basic description here

  • what's an use cases for this model
  • how this works (in general)

@menshikh-iv
Copy link
Contributor

Thank you @souravsingh

@menshikh-iv menshikh-iv merged commit c396ad9 into piskvorky:develop Apr 3, 2018
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes.

chunksize : int, optional
Controls the size of the mini-batches.
passes : int, optional
Number of times the model makes a pass over the entire training data.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training data => training corpus (consistency helps with clarity)

Threshold value of gamma(topic difference between consecutive two topics)
until which the iterations continue.
serialized : bool, optional
Indicates whether the input corpora to the model are simple lists
Copy link
Owner

@piskvorky piskvorky Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "input corpora"? Where do they come from?

What does "or saved to the hard-drive" mean? If false, "input corpora" are saved to a hard-drive? Why are we saving input?

The documentation of these two parameters (serialized and serialization_path) is confusing. Who should use this and why?

This is important because the docstring example at the top actually uses these parameters (that's how I found them, I wanted to know what it is).

@@ -306,14 +309,21 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d
self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy)

def __str__(self):
"""Get a string representation of object.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this docstring completely (no added value, just adds fluff).


Raises
------
AssertionError
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError is for programmer mistakes. Bad user input is either ValueError or TypeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for the author of this code, not for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants