-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix docstrings for gensim.sklearn_api
. Fix #1667
#1895
Conversation
@@ -5,11 +5,6 @@ | |||
# Copyright (C) 2017 Radim Rehurek <radimrehurek@seznam.cz> | |||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | |||
|
|||
""" | |||
Scikit learn interface for gensim for easy use of gensim with scikit-learn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some __doc__
definitely needed
gensim/sklearn_api/lsimodel.py
Outdated
Base LSI module | ||
"""Base LSI module. | ||
|
||
Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use links
:class:`~gensim.model.lsimodel.LsiModel`
here and everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, explicit mention "if you want to read more about it, please look into original class :class:...
"
gensim/sklearn_api/lsimodel.py
Outdated
|
||
Parameters | ||
---------- | ||
num_topics : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about the link to original method only (for avoiding duplication)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought about that and I am not sure what is better. On one hand now we have duplication but on the other hand its easier for the developer and user to see the documentation in one tab. Because not all parameters are propagated to the inner model, some of the parameters will be visible in the wrapper and some in the original model (you would need 2 tabs open). I am a bit in favor of duplicating but not 100% sure so if you prefer I will remove duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe combine both approaches: mentioned parameter & type here, but for description - sent the user to the parameter from original class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that sounds reasonable, I will apply asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use python
/jupyter
, he will call something like help(model)
or model?
and for this case, links don't work :( (and this is the main problem). For this reason - can you return descriptions for parameters, copy-paste is the lesser evil than docstring, that exists, but useless, if you can't read it in your interpreter.
Also, the link to original class must be in any case too.
- `lsimodel` - `text2bow` - `phrases` * Added `doc` in every file * Provided sphinx style links to parameter types referencing gensim classes. * Propagated arguments are still duplicated for readability - maybe remove?
…r explanation of their meaning the reader is redirected to the original models documentation
gensim/sklearn_api/lsimodel.py
Outdated
|
||
Parameters | ||
---------- | ||
num_topics : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use python
/jupyter
, he will call something like help(model)
or model?
and for this case, links don't work :( (and this is the main problem). For this reason - can you return descriptions for parameters, copy-paste is the lesser evil than docstring, that exists, but useless, if you can't read it in your interpreter.
Also, the link to original class must be in any case too.
gensim/sklearn_api/lsimodel.py
Outdated
-------- | ||
Integrate with sklearn Pipelines: | ||
|
||
>>> model = LsiTransformer(num_topics=15, id2word=id2word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great (examples is the really nice idea), but please be sure, that example executable (i.e. you write all needed imports & define data).
You can check that all works easily with python -m doctest path/to/file/with/examples.py
@steremma please be accurate with PEP8 checks, look to https://travis-ci.org/RaRe-Technologies/gensim/jobs/343473932#L509 and resolve it. |
@menshikh-iv this PR is a work in progress I only push to sync between different workstations (I opened it so that you can easily track the status). Sometimes I have to leave one workstation quickly so I just push to continue from another one. In the end the commits will be squashed so I hope it won't be a problem. We can close the PR for now and re-open when its ready for review and merge. |
…into sklearn-api-docs
gensim/sklearn_api/d2vmodel.py
Outdated
>>> # Lets represent each document using a 50 dimensional vector | ||
>>> model = D2VTransformer(min_count=1, size=50) | ||
>>> docvecs = model.fit_transform(common_texts) | ||
>>> assert docvecs.shape == (len(common_texts), 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something more interesting (like calculation similarity between documents)?
gensim/sklearn_api/lsimodel.py
Outdated
|
||
>>> # Fit our pipeline to some corpus | ||
>>> corpus = [id2word.doc2bow(i.split()) for i in data.data] | ||
>>> fitted_pipeline = pipe.fit(corpus, data.target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some evaluation here (what's score here?)
gensim/sklearn_api/lsimodel.py
Outdated
>>> | ||
>>> # Create an ID to word mapping using some corpus included in sklearn. | ||
>>> cats = ['rec.sport.baseball', 'sci.crypt'] | ||
>>> data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use gensim-data
here (20-news available)
gensim/sklearn_api/w2vmodel.py
Outdated
>>> | ||
>>> # What is the vector representation of the word 'graph'? | ||
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system']) | ||
>>> assert wordvecs.shape == (2, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call most_similar
?
…into sklearn-api-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, you missed ldaseqmodel
, please add this too
gensim/sklearn_api/atmodel.py
Outdated
|
||
For more information on the inner workings please take a look at the original class. The model's internal workings | ||
are heavily based on `"The Author-Topic Model for Authors and Documents", Osen-Zvi et. al 2004 | ||
<https://mimno.infosci.cornell.edu/info6150/readings/398.pdf>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention paper only in original class (not here), wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will replace references with some text <http://...>
_
gensim/sklearn_api/hdp.py
Outdated
Examples | ||
-------- | ||
|
||
>>> from gensim.test.utils import common_dictionary, common_corpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unindent example please (here and everywhere)
gensim.sklearn_api
. Fix #1667
Current PR will fix #1895 |
In this PR I am working on the source documentation for the sklearn API.
More wrappers will be submitted in subsequent pushes to the same PR until they are all complete.