Implement filter_extremes #169

henrifroese · 2020-08-26T15:36:55Z

We add a new function hero.filter_extremes(s: TokenSeries, max_words=None, min_df=1, max_df=1.0) to remove words from all documents that are above or below a document frequency threshold; additionally only keep max_words many words. Naming from gensim's similar function here.

Excerpt from docstring to explain functionality:

Decrease the size of your documents by
filtering out words by their frequency.

It is often useful to reduce the size of your dataset
by dropping words in order to
reduce noise and improve performance.
This function removes all words/tokens from
all documents where the
document frequency (=number of documents a term appears in) is

below min_df
above max_df.

When min_df or max_df is an integer, then document frequency
is the absolute number of documents that a term
appears in. When it's a float, it is the
proportion of documents a term appears in.

Additionally, only max_words many words are kept.

Parameters

max_words : int, default to None
The maximum number of words/tokens that
are kept, according to term frequency descending.
If None, will consider all features.

min_df : int or float, default to 1
Remove words that have a document frequency
lower than min_df. If float, it represents a
proportion of documents, integer absolute counts.

max_df : int or float, default to 1
Remove words that have a document frequency
higher than max_df. If float, it represents a
proportion of documents, integer absolute counts.

Example

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(
...        [
...         "Here one two one one one go there",
...         "two go one one one two two two is important",
...     ]
... )
>>> s.pipe(hero.tokenize).pipe(hero.filter_extremes, 3)
0              [one, two, one, one, one, go]
1    [two, go, one, one, one, two, two, two]
dtype: object

Note: only so many lines changed as this builds upon the DocumentTermDF (see #156)

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Co-authored-by: Henri Froese <henri.froese@yahoo.com>

Missing: Tests & Docstring Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

…remes

Black just rolled out V20.8b1. This creates errors with our ./tests.sh -> switch back

…ate/texthero into filter_extremes

henrifroese · 2020-08-29T11:51:38Z

Note: Black (our formatter) just rolled out V20.8b1 3 days ago. This creates errors with our ./tests.sh in preprocessing because of whitespace. Will investigate this further but atm we set the black version in .travis.yml and setup.cfg to the last working version (19.10b1).

EDIT: found the issue, see the issue opened at Black here

jbesomi · 2020-09-08T11:45:12Z

Thanks. Will review once the previous PRs are merged

jbesomi · 2020-09-14T13:35:38Z

Waiting for #162 to be merged + will need to conflicts change (and will simplify the code).

mk2510 · 2020-09-22T13:14:36Z

we have now implemented all changes from the master and this branch is also ready to review/to be merged 🐙 🥇

mk2510 and others added 16 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

Address most review comments.

e3af2f9

Add more unittests for representation

77ad80e

Fix the term_frequency formula. Simplify the function body.

3fbeaa5

Co-authored-by: Henri Froese <henri.froese@yahoo.com>

Implement filter_extremes.

1e8857a

Missing: Tests & Docstring Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

Add Docstring to filter_extremes

a7ddccb

added test for filter extrems

4cdf2c1

Merge remote-tracking branch 'origin/filter_extremes' into filter_ext…

df8587b

…remes

added example in docstring + typing

a5f3736

vercel bot temporarily deployed to Preview August 26, 2020 15:37 Inactive

addded example

8a6fde0

vercel bot deployed to Preview August 26, 2020 15:38 View deployment

format with new black version

d6cc5f8

vercel bot deployed to Preview August 26, 2020 18:09 View deployment

henrifroese mentioned this pull request Aug 28, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

henrifroese added 2 commits August 28, 2020 17:05

Fix formatting errors by rolling back black update.

86c1c09

Black just rolled out V20.8b1. This creates errors with our ./tests.sh -> switch back

Merge branch 'filter_extremes' of https://github.com/SummerOfCode-NoH…

3946bd8

…ate/texthero into filter_extremes

vercel bot deployed to Preview August 28, 2020 15:48 View deployment

Finish fixing formatting.

b0ca92c

vercel bot deployed to Preview August 29, 2020 11:22 View deployment

henrifroese mentioned this pull request Aug 29, 2020

Doctests fail with new Black version. #171

Closed

henrifroese added the enhancement New feature or request label Sep 6, 2020

jbesomi marked this pull request as draft September 14, 2020 13:34

mk2510 added 5 commits September 22, 2020 12:35

Merge branch 'master_upstream' into fix_formula_in_term_frequency

5ed8283

fixed merge issues

efd9fde

fix formatting

c1dd5eb

Merge branch 'fix_formula_in_term_frequency' into filter_extremes

5e9c33d

fixed merge issues

87eef82

vercel bot deployed to Preview September 22, 2020 13:12 View deployment

Merge branch 'master' into filter_extremes

9cc2f8d

vercel bot deployed to Preview September 22, 2020 19:49 View deployment

mk2510 marked this pull request as ready for review September 22, 2020 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement filter_extremes #169

Implement filter_extremes #169

henrifroese commented Aug 26, 2020 •

edited by mk2510

Loading

henrifroese commented Aug 29, 2020 •

edited

Loading

jbesomi commented Sep 8, 2020

jbesomi commented Sep 14, 2020

mk2510 commented Sep 22, 2020 •

edited

Loading

Implement filter_extremes #169

Are you sure you want to change the base?

Implement filter_extremes #169

Conversation

henrifroese commented Aug 26, 2020 • edited by mk2510 Loading

Excerpt from docstring to explain functionality:

Example

henrifroese commented Aug 29, 2020 • edited Loading

jbesomi commented Sep 8, 2020

jbesomi commented Sep 14, 2020

mk2510 commented Sep 22, 2020 • edited Loading

henrifroese commented Aug 26, 2020 •

edited by mk2510

Loading

henrifroese commented Aug 29, 2020 •

edited

Loading

mk2510 commented Sep 22, 2020 •

edited

Loading