-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the off-by-one bug in the TFIDF model. #2392
Conversation
Fixes #2375. Use len to compute the number of features. Since the ids are zero-indexed, Using max causes an off-by-one bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. Looks good. Left you some comments.
As a side note, is it possible to unit test this thing somehow? I notice the only visible output is a logging statement, so tests may be tricky/useless, but just wanted to make sure.
gensim/models/tfidfmodel.py
Outdated
@@ -390,7 +390,7 @@ def initialize(self, corpus): | |||
self.num_nnz = numnnz | |||
self.dfs = dfs | |||
# and finally compute the idf weights | |||
n_features = max(dfs) if dfs else 0 | |||
n_features = len(dfs) if dfs else 0 | |||
logger.info( | |||
"calculating IDF weights for %i documents and %i features (%i matrix non-zeros)", | |||
self.num_docs, n_features, self.num_nnz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.num_docs, n_features, self.num_nnz | |
self.num_docs, len(dfs), self.num_nnz |
Looks good, thanks @AMR-KELEG ! Can you check whether the same issues is present somewhere else, too? (e.g. from that same refactoring) |
gensim/models/tfidfmodel.py
Outdated
logger.info( | ||
"calculating IDF weights for %i documents and %i features (%i matrix non-zeros)", | ||
self.num_docs, n_features, self.num_nnz | ||
self.num_docs, 1 + max([-1] + list(dfs.keys())), self.num_nnz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.num_docs, 1 + max([-1] + list(dfs.keys())), self.num_nnz | |
self.num_docs, max(dfs.keys()) + 1 if dfs else 0, self.num_nnz |
I have checked the commit and I don't think this edit was introduced in any other files. I have also searched the whole project for this line |
Fixes #2375.
Use len to compute the number of features.
Since the ids are zero-indexed, Using max causes an off-by-one bug.