Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Skip common English words in phrases #2979

Merged
merged 12 commits into from
Oct 14, 2020
Merged

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Oct 10, 2020

Follow up from #2976 (comment) : add a list of English "common words" to skip during phrase construction.

These English words are an optional switch; the default doesn't change.

Backward incompatible change: this PR renamed the common_terms parameter of gensim.models.phrases.Phrases() to connector_words, with the same meaning.

Fixes #2520. Fixes #1465.

@piskvorky piskvorky force-pushed the phrases_common_words branch from cc6ce47 to 9e503a4 Compare October 10, 2020 20:14
@piskvorky
Copy link
Owner Author

piskvorky commented Oct 10, 2020

I still cannot access CircleCI (not even from an incognito window) – @mpenkov can you please check what it's complaining about? Thanks.

@piskvorky
Copy link
Owner Author

piskvorky commented Oct 11, 2020

10 random phrases that were discarded (left column) vs newly introduced (right column) by this PR. Trained on a toy corpus
of text8, default settings:

phrases(develop) - phrases(this PR) phrases(this PR) - phrases(develop)
crash_in died_of_a_stroke
the_remainder from_abram
prefixed_by prefectures_of_greece
the_serfs new_creative
of_rakis conquered_by_the_assyrian
releasing_a stream_of_knowledge
an_inventive tend_to_consider
overheard_by war_in_iraq
smoothly_at destruction_of_troy
an_offshore four_preludes

@gojomo
Copy link
Collaborator

gojomo commented Oct 11, 2020

I'm not sure text8 (small, weird) is a great reference dataset for evaluating this feature. But I can believe that most users of Phrases on English-text will benefit from enabling this common_words auto-expansion of the original Phrases algorithm beyond just bigrams.

Still, the case against making it the silent default is:

  • users following the exact same high-level interface with the same training text as in Gensim 0.x through 3.8.3 shouldn't be surprised by a different set of phrases being created, especially phrases of more-than-2-unigrams
  • explicit is better than implicit, and algorithms that aren't inherently linked to a specific language shouldn't be language-specialized without clear user opt-in
  • AFAICT, this isn't based on any published work with evaluated costs/benefits, but is lossely-inspired (but still different from ) a feature in Elasticsearch. A few spot/ad-hoc cases where it "looks good" doesn't seem enough to justify a default-for-everyone deviation from the published, well-characterized statistical algorithm on which the model was originally based.

If this previously-underappreciated feature is now liked enough to recommend more widely, I believe a better way to do that would be to update docs/tutorials/change-notes to highlight the availability of the ENGLISH_COMMON_TERMS preset, & use it where appropriate, with discussion of when it is or isn't appropriate.

@piskvorky
Copy link
Owner Author

I agree. I'll make this optional.

@piskvorky piskvorky changed the title [WIP] Skip common English words in phrases [MRG] Skip common English words in phrases Oct 12, 2020
@piskvorky
Copy link
Owner Author

@mpenkov please review & merge (after checking that circleci fail).

@piskvorky piskvorky requested review from gojomo and mpenkov October 12, 2020 12:37
@piskvorky piskvorky force-pushed the phrases_common_words branch from 712d34d to b00b393 Compare October 12, 2020 12:39
gensim/models/phrases.py Outdated Show resolved Hide resolved
@piskvorky
Copy link
Owner Author

piskvorky commented Oct 14, 2020

Merging to keep branches clean. @mpenkov review welcome as always, even if it's later.

@piskvorky piskvorky merged commit 1d0a8bd into develop Oct 14, 2020
@piskvorky piskvorky deleted the phrases_common_words branch October 14, 2020 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants