make `remove_stopwords()` behavior more consistent #3171

gojomo · 2021-06-12T18:04:12Z

(triggered by SO question: https://stackoverflow.com/questions/67944732/using-my-own-stopword-list-with-gensim-corpora-textcorpus-textcorpus/67951592#67951592)

Gensim has two remove_stopwords() functions with similar, but slightly-different behavior that risks confusing users.

gensim.parsing.preprocessing.remove_stopwords takes a space-delimited string, and always consults the current value of gensim.parsing.preprocessing.STOPWORDS (including any user reassignments).

By contrast, gensim.corpora.textcorpus.remove_stopwords takes a list-of-tokens, allows the specification of an alternate list-of-stopwords, but if allowed to use its 'default' stopwords, uses only the value of gensim.parsing.preprocessing.STOPWORDS that was captured when the function was defined (which could miss later user redefinitions).

Avoiding the reuse of identical function names that take different argument-types would help avoid user/code-reviewer confusion. Also, capturing the value of a variable that might change leads to confusing function-behavior: the token version could just as easily consult gensim.parsing.preprocessing.STOPWORDS on each call (as the string version already does).

The text was updated successfully, but these errors were encountered:

rock420 · 2021-06-17T11:18:15Z

Hi @gojomo, I would like to work on this issue.
I need some clarification regarding this -

Avoiding the reuse of identical function names that take different argument-types would help avoid user/code-reviewer confusion.

Should we keep two different names for the above two functions? For example, we can keep the name of the function in gensim.corpora.textcorpus as remove_stopwords_token() or something more understandable to the user.
Or
Taking the gensim.parsing.preprocessing.remove_stopwords as standard and using it in the gensim.corpora.textcorpus.TextCorpus for preprocessing.

Would like to know your thoughts on the above two or any other function name you have in mind.

gojomo · 2021-06-17T16:40:01Z

It'd be good to have @piskvorky's preferences here, but my rough approach might be:

see which one is more used in more places throughout our code - let that one keep its existing name/signature (principle of: cause minimal adjustment pain for others) - I suspect that will be the preprocessing one, as that sounds more generic/central.
try to redefine the other one to be more clearly named, and be clearly related to the other - perhaps by relocating into the same module as the other (directly alongside it in the source code), and calling/being-called-by the other as appropriate. If it's the list-of-tokens one that gets renamed, I'd find remove_stopword_tokens() more naturally descriptive in an imperative language style, as in "remove the stopword tokens"".
ensure both respect the global STOPWORDS (if the user has redefined it), and both offer similar ability to specify alternate stopwords
wrap the old function that's been removed with a deprecation warning pointing to the new alternative, so any changes don't immediately break old code, but that someday the less-preferable, confusing, older function can be removed entirely. at the same time, ensure our code uses the preferred, not the deprecated, function.

Thanks for your interest!

piskvorky · 2021-06-17T17:58:35Z

I wasn't aware of gensim.corpora.textcorpus.remove_stopwords, which seems to have been added in #1459.

Yes, IMO such functions should live in gensim.preprocessing – and better yet, outside of Gensim. I'd prefer Gensim to be pre-processing free, for reasons stated in the tutorials:

(primarily) Preprocessing is domain-specific and, similarly to evaluation, works best when tuned to your specific application. I don't have much faith in general-purpose NLP.
(secondary) There are other, more focused NLP libraries for text preprocessing. (But see my caveat above)

For now, consolidating preprocessing under the preprocessing module is preferable, thanks.

rock420 · 2021-06-18T04:54:20Z

Thanks, @gojomo, and @piskvorky for this detailed explanation.

For now, consolidating preprocessing under the preprocessing module is preferable, thanks.

So, I will rename the function in gensim.corpora.textcorpus ( as it's not used in any other module) and will put that under the preprocessing module.

There are other generic preprocessing function in gensim.corpora.textcorpus ( remove_short(), lower_to_unicode(), strip_multiple_whitespaces() ). Should I also move those functions and similar preprocessing functions present in other modules into the preprocessing module.

piskvorky · 2021-06-18T05:34:15Z

Yes. Such one-liners are weird, but if they must exist, let them exist under preprocessing.

rock420 · 2021-06-18T05:41:16Z

ok, I will make those changes and link a PR soon.

gojomo · 2021-06-18T16:08:43Z

Once it's working in its new location, I think we may then see some further ideas on clearer naming.

@piskvorky - do you suggest the 1-liners that move into .preprocessing. need deprecated aliases passing through from their old names in .textcorpus.?

piskvorky · 2021-06-18T19:23:48Z

Is an alias even needed? from gensim.preprocessing import X, Y, Z ought to do the same job.

gojomo · 2021-06-18T20:57:57Z

If they all keep the same names, sure. But if usage from the disfavored place is to be discouraged and even eventually removed, aliases would be required to attach the deprecation warnings.

rock420 · 2021-06-19T06:41:13Z

But if usage from the disfavored place is to be discouraged and even eventually removed, aliases would be required to attach the deprecation warnings.

In that case, should I keep the function names in .preprocessing same as in .textcorpus?
Also there are some functions, like, strip_multiple_whitespaces() which exist in both textcorpus and preprocessing module with same name and functionality.

Currently, I am doing this -

import gensim.parsing.preprocessing as preprocessing

@deprecated("Function will be removed in future releases,use gensim.parsing.preprocessing.strip_multiple_whitespaces() instead")
def strip_multiple_whitespaces(s):
    return preprocessing.strip_multiple_whitespaces(s)

gojomo · 2021-06-19T06:59:00Z

If the functions are truly identical in signature & functionality, I think the approach you've shown, with a deprecating-forwarding-function, is the diligent incremental improvement that's sure to warn, rather than break, any uses of the .textcorpus. versions. (Of course, there might be zero users-in-the-field calling those. It's unfortunate the duplicate behavior slipped in, making more work now to gracefully remove it.)

piskvorky · 2021-06-19T07:43:07Z

I don't get it. If the signature & functionality is the same, what's the deprecation for? No code will break if you just import the names from their location, so why bother users with some warnings?

gojomo · 2021-06-19T09:31:49Z

If it was an error to have duplicated (or misplaced) such functions, I'd have thought that we'd want to eventually fix the error, and shrink the module where they don't belong.

It's a 'code smell' to leave such redundant cruft in place indefinitely, making source files & API docs larger without tangible benefit, and at a cost in code readability. There might well be zero outside users who are calling the misplaced .textcorpus. versions... but if they're left in place as options with no discouragement, future users might mix-and-match calls, multiplying the prospect future code readers have to wonder, & then research: "is there a reason for, and difference between, these two modules' similar-looking functions?"

"There should be one— and preferably only one —obvious way to do it."

Personally, I'd just remove the redundant versions ASAP – "now is better than never" – add a release note, and let people who hit an error due to the change (if any such people exist) fix their code.

piskvorky · 2021-06-19T11:53:24Z

I still don't get it. TextCorpus needs to import these functions anyway, because its code makes use of them.

So the imports will be there either way. My question was, why re-wrap them in a deprecation? How does that help us maintainers, or the users?

Or are you suggesting removing the one-liners altogether (= rewriting TextCorpus), instead of moving them to preprocessing?

rock420 · 2021-06-20T16:17:09Z

So the imports will be there either way.

Yes, we need to imports those functions in Textcorpus thus any user code using the previous implementation will not break. Technically it will be correct and users will not face any issue.

According to @gojomo and what I have understood (please correct me if I am wrong) -
If we let users call those functions directly from textcorpus (like textcorpus.strip_multiple_whitespaces()) without any discouragement (by not giving any deprecating warning for such calls), it might create some sort of confusions for them and future users (who will see both the calls textcorpus.strip_multiple_whitespaces() & preprocessing.strip_multiple_whitespaces() across different codes).

But as there would be very few users who are using gensim preprocessing functions directly (as libraries like nltk are more preferable for such tasks), deprecation warnings may not provide much benefit.

gojomo · 2021-06-21T18:08:08Z

I still don't get it. TextCorpus needs to import these functions anyway, because its code makes use of them.

So the imports will be there either way. My question was, why re-wrap them in a deprecation? How does that help us maintainers, or the users?

Or are you suggesting removing the one-liners altogether (= rewriting TextCorpus), instead of moving them to preprocessing?

Aha. OK, if the functions keep the same name & merely move to a more-sensible place (preprocessing), then I'm overthinking this, since anywhere that imports a function also unavoidably exports it. (Though then, without appearing in the module docs.)

But the concern remains with the textcorpus.remove_stopwords() function itself. If it both moves and its name changes for clarity - which I think is a good idea, given the potential confusion with preprocessing.remove_stopwords() – should textcorpus still essentially provide a remove_stopwords function with the old signature/behavior, or not? And if it does, is there a warning that it is superfluous & deprecated, or does it remain as a monument to a past error forever? For example, would the following line in textcorpus.py be a service or disservice to Gensim users?

from gensim.preprocessing. import remove_stopword_tokens as remove_stopwords

(And, the concern might repeat if any other textcorpus utility preprocessing functions, when moved to preprocessing, deserve any other naming/parameter changes to be more clearly consistent with other functions there.)

piskvorky · 2021-06-21T19:46:10Z

Yeah, I don't think anyone's using textcorpus.remove_tokens. And if they do, the update to their code should be trivial.

So let's do whatever makes sense to us. IMO that means giving the functions suitable names and moving them under preprocessing.

gojomo added difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Jun 12, 2021

rock420 mentioned this issue Jun 22, 2021

Isolate generic preprocessing functions #3180

Merged

2 tasks

mpenkov closed this as completed in #3180 Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make `remove_stopwords()` behavior more consistent #3171

make `remove_stopwords()` behavior more consistent #3171

gojomo commented Jun 12, 2021

rock420 commented Jun 17, 2021

gojomo commented Jun 17, 2021 •

edited

Loading

piskvorky commented Jun 17, 2021 •

edited

Loading

rock420 commented Jun 18, 2021 •

edited

Loading

piskvorky commented Jun 18, 2021

rock420 commented Jun 18, 2021

gojomo commented Jun 18, 2021 •

edited

Loading

piskvorky commented Jun 18, 2021

gojomo commented Jun 18, 2021

rock420 commented Jun 19, 2021 •

edited

Loading

gojomo commented Jun 19, 2021

piskvorky commented Jun 19, 2021 •

edited

Loading

gojomo commented Jun 19, 2021

piskvorky commented Jun 19, 2021 •

edited

Loading

rock420 commented Jun 20, 2021

gojomo commented Jun 21, 2021

piskvorky commented Jun 21, 2021

make remove_stopwords() behavior more consistent #3171

make remove_stopwords() behavior more consistent #3171

Comments

gojomo commented Jun 12, 2021

rock420 commented Jun 17, 2021

gojomo commented Jun 17, 2021 • edited Loading

piskvorky commented Jun 17, 2021 • edited Loading

rock420 commented Jun 18, 2021 • edited Loading

piskvorky commented Jun 18, 2021

rock420 commented Jun 18, 2021

gojomo commented Jun 18, 2021 • edited Loading

piskvorky commented Jun 18, 2021

gojomo commented Jun 18, 2021

rock420 commented Jun 19, 2021 • edited Loading

gojomo commented Jun 19, 2021

piskvorky commented Jun 19, 2021 • edited Loading

gojomo commented Jun 19, 2021

piskvorky commented Jun 19, 2021 • edited Loading

rock420 commented Jun 20, 2021

gojomo commented Jun 21, 2021

piskvorky commented Jun 21, 2021

make `remove_stopwords()` behavior more consistent #3171

make `remove_stopwords()` behavior more consistent #3171

gojomo commented Jun 17, 2021 •

edited

Loading

piskvorky commented Jun 17, 2021 •

edited

Loading

rock420 commented Jun 18, 2021 •

edited

Loading

gojomo commented Jun 18, 2021 •

edited

Loading

rock420 commented Jun 19, 2021 •

edited

Loading

piskvorky commented Jun 19, 2021 •

edited

Loading

piskvorky commented Jun 19, 2021 •

edited

Loading