-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make remove_stopwords()
behavior more consistent
#3171
Comments
Hi @gojomo, I would like to work on this issue.
Should we keep two different names for the above two functions? For example, we can keep the name of the function in Would like to know your thoughts on the above two or any other function name you have in mind. |
It'd be good to have @piskvorky's preferences here, but my rough approach might be:
Thanks for your interest! |
I wasn't aware of Yes, IMO such functions should live in
For now, consolidating preprocessing under the |
Thanks, @gojomo, and @piskvorky for this detailed explanation.
So, I will rename the function in There are other generic preprocessing function in |
Yes. Such one-liners are weird, but if they must exist, let them exist under |
ok, I will make those changes and link a PR soon. |
Once it's working in its new location, I think we may then see some further ideas on clearer naming. @piskvorky - do you suggest the 1-liners that move into |
Is an alias even needed? |
If they all keep the same names, sure. But if usage from the disfavored place is to be discouraged and even eventually removed, aliases would be required to attach the deprecation warnings. |
In that case, should I keep the function names in Currently, I am doing this - import gensim.parsing.preprocessing as preprocessing
@deprecated("Function will be removed in future releases,use gensim.parsing.preprocessing.strip_multiple_whitespaces() instead")
def strip_multiple_whitespaces(s):
return preprocessing.strip_multiple_whitespaces(s) |
If the functions are truly identical in signature & functionality, I think the approach you've shown, with a deprecating-forwarding-function, is the diligent incremental improvement that's sure to warn, rather than break, any uses of the |
I don't get it. If the signature & functionality is the same, what's the deprecation for? No code will break if you just import the names from their location, so why bother users with some warnings? |
If it was an error to have duplicated (or misplaced) such functions, I'd have thought that we'd want to eventually fix the error, and shrink the module where they don't belong. It's a 'code smell' to leave such redundant cruft in place indefinitely, making source files & API docs larger without tangible benefit, and at a cost in code readability. There might well be zero outside users who are calling the misplaced "There should be one— and preferably only one —obvious way to do it." Personally, I'd just remove the redundant versions ASAP – "now is better than never" – add a release note, and let people who hit an error due to the change (if any such people exist) fix their code. |
I still don't get it. TextCorpus needs to import these functions anyway, because its code makes use of them. So the imports will be there either way. My question was, why re-wrap them in a deprecation? How does that help us maintainers, or the users? Or are you suggesting removing the one-liners altogether (= rewriting TextCorpus), instead of moving them to |
Yes, we need to imports those functions in Textcorpus thus any According to @gojomo and what I have understood (please correct me if I am wrong) - But as there would be very few users who are using |
Aha. OK, if the functions keep the same name & merely move to a more-sensible place ( But the concern remains with the
(And, the concern might repeat if any other |
Yeah, I don't think anyone's using So let's do whatever makes sense to us. IMO that means giving the functions suitable names and moving them under |
(triggered by SO question: https://stackoverflow.com/questions/67944732/using-my-own-stopword-list-with-gensim-corpora-textcorpus-textcorpus/67951592#67951592)
Gensim has two
remove_stopwords()
functions with similar, but slightly-different behavior that risks confusing users.gensim.parsing.preprocessing.remove_stopwords
takes a space-delimited string, and always consults the current value ofgensim.parsing.preprocessing.STOPWORDS
(including any user reassignments).By contrast,
gensim.corpora.textcorpus.remove_stopwords
takes a list-of-tokens, allows the specification of an alternate list-of-stopwords, but if allowed to use its 'default' stopwords, uses only the value ofgensim.parsing.preprocessing.STOPWORDS
that was captured when the function was defined (which could miss later user redefinitions).Avoiding the reuse of identical function names that take different argument-types would help avoid user/code-reviewer confusion. Also, capturing the value of a variable that might change leads to confusing function-behavior: the token version could just as easily consult
gensim.parsing.preprocessing.STOPWORDS
on each call (as the string version already does).The text was updated successfully, but these errors were encountered: