-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add smart information retrieval system for TfidfModel
. Fix #1785
#1791
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Please be careful with more backward compatibility
gensim/models/tfidfmodel.py
Outdated
def __init__(self, corpus=None, id2word=None, dictionary=None, | ||
wlocal=utils.identity, wglobal=df2idf, normalize=True): | ||
def __init__(self, corpus=None, id2word=None, dictionary=None, smartirs="ntc", | ||
wlocal=None, wglobal=None, normalize=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to support backward compatibility, why you change default values?
Good solution - by default support default behavior and smartirs=None
, but if user set smartirs
- use it (and ignore wlocal, wglobal, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It is backward compatible. Please check the smartirs value. :)
- This can be followed but I refrained because I removed
df2idf
function totally because it is redundant. Are you sure I should add it back?
gensim/models/tfidfmodel.py
Outdated
w_tf, w_df, w_n = smartirs | ||
|
||
if w_tf not in 'nlabL': | ||
raise ValueError('Expected term frequency weight to be one of nlabL, except got ' + w_tf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better use 'nlabL'
instead nlabL
(readability), same for got ..
gensim/models/tfidfmodel.py
Outdated
|
||
if self.wlocal is None: | ||
if n_tf == "n": | ||
self.wlocal = lambda tf, mean=None, _max=None: tf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use simple define (instead of lambda) for avoiding pickle problems (here and everywhere)
gensim/models/tfidfmodel.py
Outdated
@@ -127,11 +162,6 @@ def initialize(self, corpus): | |||
|
|||
# and finally compute the idf weights | |||
n_features = max(dfs) if dfs else 0 | |||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This showed the progress of the pre - computation
step. Now that the values are no longer being pre-computed I think we need to log this in a different way. How do you want it to go?
gensim/test/test_sklearn_api.py
Outdated
@@ -498,7 +498,6 @@ def testPersistence(self): | |||
original_matrix = self.model.transform(original_bow) | |||
passed = numpy.allclose(loaded_matrix, original_matrix, atol=1e-1) | |||
self.assertTrue(passed) | |||
|
|||
def testModelNotFitted(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add more tests (for new functionality)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have that in my checklist but before that I need to pass the already present tests.
gensim/models/tfidfmodel.py
Outdated
elif n_tf == "b": | ||
return 1 if tf > 0 else 0 | ||
elif n_tf == "L": | ||
return (1 + math.log(tf)) / (1 + math.log(mean)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the value is none of the enumerated options? Will the error make sense to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolve_weights
take care of that.
gensim/models/tfidfmodel.py
Outdated
vector = [ | ||
(termid, self.wlocal(tf) * self.idfs.get(termid)) | ||
for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0 | ||
(termid, self.wlocal(tf, mean=np.mean(np.array(bow), axis=1), _max=np.max(bow, axis=1)) * self.wglobal(self.dfs[termid], self.num_docs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wasteful (creating arrays, only to throw them away). What are the performance implications of these changes? Do you have a benchmark before/after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the approach.
gensim/models/tfidfmodel.py
Outdated
@@ -77,11 +95,69 @@ def __init__(self, corpus=None, id2word=None, dictionary=None, | |||
If `dictionary` is specified, it must be a `corpora.Dictionary` object | |||
and it will be used to directly construct the inverse document frequency | |||
mapping (then `corpus`, if specified, is ignored). | |||
|
|||
`smartirs` or SMART (System for the Mechanical Analysis and Retrieval of Text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken from wikipedia.
gensim/models/tfidfmodel.py
Outdated
w_tf, w_df, w_n = smartirs | ||
|
||
if w_tf not in 'nlabL': | ||
raise ValueError('Expected term frequency weight to be one of \'nlabL\', except got ' + w_tf + '\'') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: you should use raise ValueError("...'nlabL'...")
for avoiding \
.
gensim/models/tfidfmodel.py
Outdated
""" | ||
return add + math.log(1.0 * totaldocs / docfreq, log_base) | ||
return add + np.log(float(totaldocs) / docfreq) / np.log(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_base
doesn't used here
gensim/models/tfidfmodel.py
Outdated
def __init__(self, corpus=None, id2word=None, dictionary=None, | ||
wlocal=utils.identity, wglobal=df2idf, normalize=True): | ||
def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.identity, | ||
wglobal=df2idf, normalize=True, smartirs=None): | ||
""" | ||
Compute tf-idf by multiplying a local component (term frequency) with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you convert all docstrings in this file to numpy-style, according to my previous comment #1780 (comment)
gensim/models/tfidfmodel.py
Outdated
self.num_docs, self.num_nnz, self.idfs = None, None, None | ||
self.smartirs = smartirs | ||
|
||
if self.normalize is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if self.normalize
isn't bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case if smartirs
is not None
, self.normalize
will be defined by the values of smartirs
. If smartirs
is None
and self.normalize
is not a bool then self.normalize
must be a function. Just like wlocal
and wglobal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some tips about docstrings
@@ -973,13 +973,13 @@ def testTransform(self): | |||
|
|||
def testSetGetParams(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add more tests (also, check situations, when you pass smartirs
and wlocal
for example)
gensim/models/tfidfmodel.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
def resolve_weights(smartirs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstrings needed too (for all stuff here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Checks for validity of smartirs parameter.
is enough. Do you have anything else in mind as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)
gensim/models/tfidfmodel.py
Outdated
# not strictly necessary and could be computed on the fly in TfidfModel__getitem__. | ||
# this method is here just to speed things up a little. | ||
return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)} | ||
|
||
|
||
def wlocal_g(tf, n_tf): # TODO rename it (to avoid confusion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about renaming
gensim/models/tfidfmodel.py
Outdated
return x | ||
elif n_n == "c": | ||
return matutils.unitvec(x) | ||
# TODO write byte-size normalisation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix it too
gensim/models/tfidfmodel.py
Outdated
If `dictionary` is specified, it must be a `corpora.Dictionary` object | ||
and it will be used to directly construct the inverse document frequency | ||
mapping (then `corpus`, if specified, is ignored). | ||
wlocals : user specified function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of
wlocals : user specified function
Default for `wlocal` is identity (other options: math.sqrt, math.log1p, ...)
should be
wlocals : function, optional
description of parameter, with links to different implementations (if function defined in gensim, link should be like :func:`~gensim.some.func`)
everywhere
gensim/models/tfidfmodel.py
Outdated
and returns a sparse vector. | ||
Parameters | ||
---------- | ||
corpus : dictionary.doc2bow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type should be iterable of iterable of (int, int)
gensim/models/tfidfmodel.py
Outdated
id2word : dict | ||
id2word is an optional dictionary that maps the word_id to a token. | ||
In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used. | ||
dictionary :corpora.Dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type should be
:class:`~gensim.corpora.dictionary.Dictionary`
gensim/models/tfidfmodel.py
Outdated
`normalize=True` means set to unit length (default); `False` means don't | ||
normalize. You can also set `normalize` to your own function that accepts | ||
and returns a sparse vector. | ||
smartirs : {'None' ,'str'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str, optional
gensim/models/tfidfmodel.py
Outdated
|
||
for more information visit https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns
no needed in __init__
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @markroxor 👍
gensim/models/tfidfmodel.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
def resolve_weights(smartirs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)
gensim/models/tfidfmodel.py
Outdated
""" | ||
return add + math.log(1.0 * totaldocs / docfreq, log_base) | ||
return add + np.log(float(totaldocs) / docfreq) / np.log(log_base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a reason to use np.log
instead of math.log
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency, I am using np
everywhere. Anyways I do'nt see any problem with that. Do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem this is only a question :)
gensim/models/tfidfmodel.py
Outdated
return (1 + np.log(tf) / np.log(2)) / (1 + np.log(tf.mean(axis=0) / np.log(2))) | ||
|
||
|
||
def updated_wglobal(docfreq, totaldocs, n_df): # TODO rename it (to avoid confusion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove TODO
(about renaming)
gensim/test/test_tfidfmodel.py
Outdated
# nnn | ||
model = tfidfmodel.TfidfModel(self.corpus, smartirs='nnn') | ||
transformed_docs = [model[docs[0]], model[docs[1]]] | ||
expected_docs = [[(3, 2), (4, 2), (5, 3), (6, 2), (7, 3), (8, 2)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use hanging indents instead of vertical (here and everywhere)
gensim/test/test_tfidfmodel.py
Outdated
model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst')) | ||
self.assertTrue(model3.idfs == model4.idfs) | ||
tstvec = [] | ||
self.assertTrue(np.allclose(model3[tstvec], model4[tstvec])) # try projecting an empty vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check non-empty case too
gensim/test/test_tfidfmodel.py
Outdated
model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2')) | ||
self.assertTrue(model3.idfs == model4.idfs) | ||
tstvec = [] | ||
self.assertTrue(np.allclose(model3[tstvec], model4[tstvec])) # try projecting an empty vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as in testPersistence
gensim/test/test_tfidfmodel.py
Outdated
|
||
# Test persistence between old and new compressed model. | ||
model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc") | ||
model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better if you add gensim version in model name ("old" in comment isn't really informative)
gensim/test/test_tfidfmodel.py
Outdated
@@ -58,6 +74,13 @@ def testPersistence(self): | |||
tstvec = [] | |||
self.assertTrue(np.allclose(model[tstvec], model2[tstvec])) # try projecting an empty vector | |||
|
|||
# Test persistence between old and new model. | |||
model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, need to test than models with smartirs
save/load correctly
gensim/test/test_tfidfmodel.py
Outdated
expected_docs = [model[docs[0]], model[docs[1]]] | ||
|
||
self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0])) | ||
self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1])) | ||
# endclass TestTfidfModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this line # ...
I don't think that it is required. As this is part of docstrings and this function is implicitly used and can not be invoked by the user. Are you sure I should add docstrings? |
@markroxor I understand what you mean, very desirable add docstrings for all stuff (this isn't only for public-api users, this is for contributors too), this isn't critical as public API, but desirable. |
gensim/models/tfidfmodel.py
Outdated
@@ -272,7 +272,7 @@ def __getitem__(self, bow, eps=1e-12): | |||
|
|||
vector = [ | |||
(termid, tf * self.idfs.get(termid)) | |||
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) != 0.0 | |||
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) > eps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing in absolute value safer (easier to read and comprehend intent; plus less room for error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a discussion with @menshikh-iv stating otherwise. @menshikh-iv I am reverting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I did the same here, but you asked me to revert it.
gensim/models/tfidfmodel.py
Outdated
(termid, self.wlocal(tf) * self.idfs.get(termid)) | ||
for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0 | ||
(termid, tf * self.idfs.get(termid)) | ||
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) != 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brittle; better compare floats for equality using eps
instead. For our purposes, two floats are "equal" if their absolute difference is smaller than eps
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, #1791 (comment) and #1791 (comment) contradict each other, what you mean @piskvorky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compare floats for equality, use abs(x - y) < eps
. Not x == y
, nor x - y < eps
.
TfidfModel
. Fix #1785 [MRG]
TfidfModel
. Fix #1785 [MRG]TfidfModel
. Fix #1785
…y#1785 (piskvorky#1791) * fixing appveyor * verify weights * verify weights * smartirs ready * change old tests * remove lambdas * address suggestions * minor fix * pep8 fix * fix pickle problem * flake8 fix * fix bug in docstring * added few tests * fix normalize issue for pickling * fix normalize issue for pickling * test without sklearn api * hanging idents and new tests * add docstring * add docstring * better way cmparing floats * old way of cmp floats * doc fix[1] * doc fix[2] * fix description TODOs * fix irksome comparision
For more information check issue #1785.
Tests have failed locally because
lambda
functions cannot be serialized bypickle
.Can't monkey patch regular functions either as it is not supported either.
TODO: