-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210
Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210
Conversation
Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above.
The travis-ci check failed because:
I do not know how this can affect my commit, as not providing the optional 'keep_tokens' parameter should not change any functionality in dictionary.py, see the unit tests. How can this commit be included successfully? |
@toliwa It's not related to your code changes, just the new release of scipy 0.19 Changing it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep to a single iteration
gensim/corpora/dictionary.py
Outdated
# add ids of keep_tokens elements to good_ids | ||
if keep_tokens: | ||
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id] | ||
good_ids_copy = (v for v in itervalues(self.token2id) if no_below <= self.dfs.get(v, 0) <= no_above_abs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep to a single iteration of token2id
by adding a or v in keep_tokens
check if keep_tokens
is present
Please merge in the latest develop to make the tests pass. |
Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.
Thanks for the new feature! |
* Add the 'keep_tokens' parameter to 'filter_extremes' and test it Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above. * Create good_ids only once Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.
if no_below <= self.dfs.get(v, 0) <= no_above_abs) | ||
if keep_tokens: | ||
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id] | ||
good_ids = (v for v in itervalues(self.token2id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent in gensim -- please use hanging indent.
Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.
If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.
Unit tests are also provided to assert examples of the above.