Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make remove_stopwords() behavior more consistent #3171

Closed
gojomo opened this issue Jun 12, 2021 · 17 comments · Fixed by #3180
Closed

make remove_stopwords() behavior more consistent #3171

gojomo opened this issue Jun 12, 2021 · 17 comments · Fixed by #3180
Labels
difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@gojomo
Copy link
Collaborator

gojomo commented Jun 12, 2021

(triggered by SO question: https://stackoverflow.com/questions/67944732/using-my-own-stopword-list-with-gensim-corpora-textcorpus-textcorpus/67951592#67951592)

Gensim has two remove_stopwords() functions with similar, but slightly-different behavior that risks confusing users.

gensim.parsing.preprocessing.remove_stopwords takes a space-delimited string, and always consults the current value of gensim.parsing.preprocessing.STOPWORDS (including any user reassignments).

By contrast, gensim.corpora.textcorpus.remove_stopwords takes a list-of-tokens, allows the specification of an alternate list-of-stopwords, but if allowed to use its 'default' stopwords, uses only the value of gensim.parsing.preprocessing.STOPWORDS that was captured when the function was defined (which could miss later user redefinitions).

Avoiding the reuse of identical function names that take different argument-types would help avoid user/code-reviewer confusion. Also, capturing the value of a variable that might change leads to confusing function-behavior: the token version could just as easily consult gensim.parsing.preprocessing.STOPWORDS on each call (as the string version already does).

@gojomo gojomo added difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Jun 12, 2021
@rock420
Copy link
Contributor

rock420 commented Jun 17, 2021

Hi @gojomo, I would like to work on this issue.
I need some clarification regarding this -

Avoiding the reuse of identical function names that take different argument-types would help avoid user/code-reviewer confusion.

Should we keep two different names for the above two functions? For example, we can keep the name of the function in gensim.corpora.textcorpus as remove_stopwords_token() or something more understandable to the user.
Or
Taking the gensim.parsing.preprocessing.remove_stopwords as standard and using it in the gensim.corpora.textcorpus.TextCorpus for preprocessing.

Would like to know your thoughts on the above two or any other function name you have in mind.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 17, 2021

It'd be good to have @piskvorky's preferences here, but my rough approach might be:

  • see which one is more used in more places throughout our code - let that one keep its existing name/signature (principle of: cause minimal adjustment pain for others) - I suspect that will be the preprocessing one, as that sounds more generic/central.
  • try to redefine the other one to be more clearly named, and be clearly related to the other - perhaps by relocating into the same module as the other (directly alongside it in the source code), and calling/being-called-by the other as appropriate. If it's the list-of-tokens one that gets renamed, I'd find remove_stopword_tokens() more naturally descriptive in an imperative language style, as in "remove the stopword tokens"".
  • ensure both respect the global STOPWORDS (if the user has redefined it), and both offer similar ability to specify alternate stopwords
  • wrap the old function that's been removed with a deprecation warning pointing to the new alternative, so any changes don't immediately break old code, but that someday the less-preferable, confusing, older function can be removed entirely. at the same time, ensure our code uses the preferred, not the deprecated, function.

Thanks for your interest!

@piskvorky
Copy link
Owner

piskvorky commented Jun 17, 2021

I wasn't aware of gensim.corpora.textcorpus.remove_stopwords, which seems to have been added in #1459.

Yes, IMO such functions should live in gensim.preprocessing – and better yet, outside of Gensim. I'd prefer Gensim to be pre-processing free, for reasons stated in the tutorials:

  1. (primarily) Preprocessing is domain-specific and, similarly to evaluation, works best when tuned to your specific application. I don't have much faith in general-purpose NLP.

  2. (secondary) There are other, more focused NLP libraries for text preprocessing. (But see my caveat above)

For now, consolidating preprocessing under the preprocessing module is preferable, thanks.

@rock420
Copy link
Contributor

rock420 commented Jun 18, 2021

Thanks, @gojomo, and @piskvorky for this detailed explanation.

For now, consolidating preprocessing under the preprocessing module is preferable, thanks.

So, I will rename the function in gensim.corpora.textcorpus ( as it's not used in any other module) and will put that under the preprocessing module.

There are other generic preprocessing function in gensim.corpora.textcorpus ( remove_short(), lower_to_unicode(), strip_multiple_whitespaces() ). Should I also move those functions and similar preprocessing functions present in other modules into the preprocessing module.

@piskvorky
Copy link
Owner

Yes. Such one-liners are weird, but if they must exist, let them exist under preprocessing.

@rock420
Copy link
Contributor

rock420 commented Jun 18, 2021

ok, I will make those changes and link a PR soon.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 18, 2021

Once it's working in its new location, I think we may then see some further ideas on clearer naming.

@piskvorky - do you suggest the 1-liners that move into .preprocessing. need deprecated aliases passing through from their old names in .textcorpus.?

@piskvorky
Copy link
Owner

Is an alias even needed? from gensim.preprocessing import X, Y, Z ought to do the same job.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 18, 2021

If they all keep the same names, sure. But if usage from the disfavored place is to be discouraged and even eventually removed, aliases would be required to attach the deprecation warnings.

@rock420
Copy link
Contributor

rock420 commented Jun 19, 2021

But if usage from the disfavored place is to be discouraged and even eventually removed, aliases would be required to attach the deprecation warnings.

In that case, should I keep the function names in .preprocessing same as in .textcorpus?
Also there are some functions, like, strip_multiple_whitespaces() which exist in both textcorpus and preprocessing module with same name and functionality.

Currently, I am doing this -

import gensim.parsing.preprocessing as preprocessing

@deprecated("Function will be removed in future releases,use gensim.parsing.preprocessing.strip_multiple_whitespaces() instead")
def strip_multiple_whitespaces(s):
    return preprocessing.strip_multiple_whitespaces(s)

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 19, 2021

If the functions are truly identical in signature & functionality, I think the approach you've shown, with a deprecating-forwarding-function, is the diligent incremental improvement that's sure to warn, rather than break, any uses of the .textcorpus. versions. (Of course, there might be zero users-in-the-field calling those. It's unfortunate the duplicate behavior slipped in, making more work now to gracefully remove it.)

@piskvorky
Copy link
Owner

piskvorky commented Jun 19, 2021

I don't get it. If the signature & functionality is the same, what's the deprecation for? No code will break if you just import the names from their location, so why bother users with some warnings?

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 19, 2021

If it was an error to have duplicated (or misplaced) such functions, I'd have thought that we'd want to eventually fix the error, and shrink the module where they don't belong.

It's a 'code smell' to leave such redundant cruft in place indefinitely, making source files & API docs larger without tangible benefit, and at a cost in code readability. There might well be zero outside users who are calling the misplaced .textcorpus. versions... but if they're left in place as options with no discouragement, future users might mix-and-match calls, multiplying the prospect future code readers have to wonder, & then research: "is there a reason for, and difference between, these two modules' similar-looking functions?"

"There should be one— and preferably only one —obvious way to do it."

Personally, I'd just remove the redundant versions ASAP – "now is better than never" – add a release note, and let people who hit an error due to the change (if any such people exist) fix their code.

@piskvorky
Copy link
Owner

piskvorky commented Jun 19, 2021

I still don't get it. TextCorpus needs to import these functions anyway, because its code makes use of them.

So the imports will be there either way. My question was, why re-wrap them in a deprecation? How does that help us maintainers, or the users?

Or are you suggesting removing the one-liners altogether (= rewriting TextCorpus), instead of moving them to preprocessing?

@rock420
Copy link
Contributor

rock420 commented Jun 20, 2021

So the imports will be there either way.

Yes, we need to imports those functions in Textcorpus thus any user code using the previous implementation will not break. Technically it will be correct and users will not face any issue.

According to @gojomo and what I have understood (please correct me if I am wrong) -
If we let users call those functions directly from textcorpus (like textcorpus.strip_multiple_whitespaces()) without any discouragement (by not giving any deprecating warning for such calls), it might create some sort of confusions for them and future users (who will see both the calls textcorpus.strip_multiple_whitespaces() & preprocessing.strip_multiple_whitespaces() across different codes).

But as there would be very few users who are using gensim preprocessing functions directly (as libraries like nltk are more preferable for such tasks), deprecation warnings may not provide much benefit.

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 21, 2021

I still don't get it. TextCorpus needs to import these functions anyway, because its code makes use of them.

So the imports will be there either way. My question was, why re-wrap them in a deprecation? How does that help us maintainers, or the users?

Or are you suggesting removing the one-liners altogether (= rewriting TextCorpus), instead of moving them to preprocessing?

Aha. OK, if the functions keep the same name & merely move to a more-sensible place (preprocessing), then I'm overthinking this, since anywhere that imports a function also unavoidably exports it. (Though then, without appearing in the module docs.)

But the concern remains with the textcorpus.remove_stopwords() function itself. If it both moves and its name changes for clarity - which I think is a good idea, given the potential confusion with preprocessing.remove_stopwords() – should textcorpus still essentially provide a remove_stopwords function with the old signature/behavior, or not? And if it does, is there a warning that it is superfluous & deprecated, or does it remain as a monument to a past error forever? For example, would the following line in textcorpus.py be a service or disservice to Gensim users?

from gensim.preprocessing. import remove_stopword_tokens as remove_stopwords

(And, the concern might repeat if any other textcorpus utility preprocessing functions, when moved to preprocessing, deserve any other naming/parameter changes to be more clearly consistent with other functions there.)

@piskvorky
Copy link
Owner

Yeah, I don't think anyone's using textcorpus.remove_tokens. And if they do, the update to their code should be trivial.

So let's do whatever makes sense to us. IMO that means giving the functions suitable names and moving them under preprocessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants