-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Subpackage refactor. Fix 1584 #1607
Conversation
@macks22 please review my commit about CoherenceModel refactoring, maybe you have any suggestions about it. |
…-refactor # Conflicts: # gensim/corpora/textcorpus.py # gensim/topic_coherence/indirect_confirmation_measure.py
EPSILON = 1e-12 # Should be small. Value as suggested in paper. | ||
|
||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The presence of these docstrings throughout the module makes me think the existing module structure was actually a pretty useful way to separate these things out. Have you considered adding a coherence_utils
package with from <existing_module> import *
instead of throwing them all in the same module? This seems cleaner to me. I think it makes things easier to find and navigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe simple moving & renaming coherence_model->gensim.models.coherence_utils
OR gensim.models.coherence_inner
is better and you are right (because I don't think that this variant is sufficiently consistent), for this reason, I'm asking your for review, thanks 👍
@@ -1278,3 +1277,456 @@ def lazy_flatten(nested_list): | |||
yield sub | |||
else: | |||
yield el | |||
|
|||
|
|||
class PorterStemmer(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a similar comment applies here to my comment above about the coherence_utils
module. I think this would be better organized as a package, with subpackages for preprocessing
, corpora
, etc. I think the existing utils
module was already getting a bit bloated. I would hypothesize very long files with a wide variety of things makes it hard to find things for folks who are new to the codebase. I might even advocate a more granular decomposition such as: utils/stemming
, utils/lemmatizing
, utils/corpora
, utils/helpers
, or something of that sort, with aggregation in the utils/__init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I agree (maybe not very "detailed" utils, but splitting gensim.utils
to gensim.utils.utils
(old) and gensim.utils.preprocessing
is better)
return " ".join(w for w in s.split() if w not in STOPWORDS) | ||
|
||
|
||
RE_PUNCT = re.compile('([%s])+' % re.escape(string.punctuation), re.UNICODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on moving this into some kind of namespace object called regexes
, or something of that sort, either as a module or a class with these as class attributes?
The dedicated namespace would make it easier to find these, provide a single place for users to look for existing regex patterns, and also make it easier to document them. The downside I see is moving their definitions away from where they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, maybe simple storage class will be better for this purposes.
@menshikh-iv I left a few comments, all regarding modularity. I really like the work you've been doing to make the code better organized and more consistently styled. I think this is a good direction, and I hope my comments are useful. Cheers! |
Continued in #1618 |
Fixes #1584.