Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Subpackage refactor. Fix 1584 #1607

Closed
wants to merge 8 commits into from
Closed

Conversation

menshikh-iv
Copy link
Contributor

@menshikh-iv menshikh-iv commented Oct 3, 2017

Fixes #1584.

@menshikh-iv menshikh-iv added the breaks backward-compatibility Change breaks backward compatibility label Oct 3, 2017
@menshikh-iv
Copy link
Contributor Author

@macks22 please review my commit about CoherenceModel refactoring, maybe you have any suggestions about it.

…-refactor

# Conflicts:
#	gensim/corpora/textcorpus.py
#	gensim/topic_coherence/indirect_confirmation_measure.py
EPSILON = 1e-12 # Should be small. Value as suggested in paper.


"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presence of these docstrings throughout the module makes me think the existing module structure was actually a pretty useful way to separate these things out. Have you considered adding a coherence_utils package with from <existing_module> import * instead of throwing them all in the same module? This seems cleaner to me. I think it makes things easier to find and navigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe simple moving & renaming coherence_model->gensim.models.coherence_utils OR gensim.models.coherence_inneris better and you are right (because I don't think that this variant is sufficiently consistent), for this reason, I'm asking your for review, thanks 👍

@@ -1278,3 +1277,456 @@ def lazy_flatten(nested_list):
yield sub
else:
yield el


class PorterStemmer(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a similar comment applies here to my comment above about the coherence_utils module. I think this would be better organized as a package, with subpackages for preprocessing, corpora, etc. I think the existing utils module was already getting a bit bloated. I would hypothesize very long files with a wide variety of things makes it hard to find things for folks who are new to the codebase. I might even advocate a more granular decomposition such as: utils/stemming, utils/lemmatizing, utils/corpora, utils/helpers, or something of that sort, with aggregation in the utils/__init__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree (maybe not very "detailed" utils, but splitting gensim.utils to gensim.utils.utils (old) and gensim.utils.preprocessing is better)

return " ".join(w for w in s.split() if w not in STOPWORDS)


RE_PUNCT = re.compile('([%s])+' % re.escape(string.punctuation), re.UNICODE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on moving this into some kind of namespace object called regexes, or something of that sort, either as a module or a class with these as class attributes?

The dedicated namespace would make it easier to find these, provide a single place for users to look for existing regex patterns, and also make it easier to document them. The downside I see is moving their definitions away from where they are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, maybe simple storage class will be better for this purposes.

@macks22
Copy link
Contributor

macks22 commented Oct 6, 2017

@menshikh-iv I left a few comments, all regarding modularity. I really like the work you've been doing to make the code better organized and more consistently styled. I think this is a good direction, and I hope my comments are useful. Cheers!

@menshikh-iv
Copy link
Contributor Author

Continued in #1618

@menshikh-iv menshikh-iv deleted the subpackage-refactor branch October 10, 2017 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks backward-compatibility Change breaks backward compatibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants