Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement filter_extremes #169

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
fa342a9
added MultiIndex DF support
mk2510 Aug 18, 2020
59a9f8c
beginning with tests
henrifroese Aug 19, 2020
19c52de
implemented correct sparse support
mk2510 Aug 19, 2020
66e566c
Merge branch 'master_upstream' into change_representation_to_multicolumn
mk2510 Aug 21, 2020
41f55a8
added back list() and rm .tolist()
mk2510 Aug 21, 2020
217611a
rm .tolist() and added list()
mk2510 Aug 21, 2020
6a3b56d
Adopted the test to the new dataframes
mk2510 Aug 21, 2020
b8ff561
wrong format
mk2510 Aug 21, 2020
e3af2f9
Address most review comments.
henrifroese Aug 21, 2020
77ad80e
Add more unittests for representation
henrifroese Aug 21, 2020
3fbeaa5
Fix the term_frequency formula. Simplify the function body.
mk2510 Aug 25, 2020
1e8857a
Implement filter_extremes.
henrifroese Aug 26, 2020
a7ddccb
Add Docstring to filter_extremes
henrifroese Aug 26, 2020
4cdf2c1
added test for filter extrems
mk2510 Aug 26, 2020
df8587b
Merge remote-tracking branch 'origin/filter_extremes' into filter_ext…
mk2510 Aug 26, 2020
a5f3736
added example in docstring + typing
mk2510 Aug 26, 2020
8a6fde0
addded example
mk2510 Aug 26, 2020
d6cc5f8
format with new black version
mk2510 Aug 26, 2020
86c1c09
Fix formatting errors by rolling back black update.
henrifroese Aug 28, 2020
3946bd8
Merge branch 'filter_extremes' of https://github.com/SummerOfCode-NoH…
henrifroese Aug 28, 2020
b0ca92c
Finish fixing formatting.
henrifroese Aug 29, 2020
5ed8283
Merge branch 'master_upstream' into fix_formula_in_term_frequency
mk2510 Sep 22, 2020
efd9fde
fixed merge issues
mk2510 Sep 22, 2020
c1dd5eb
fix formatting
mk2510 Sep 22, 2020
5e9c33d
Merge branch 'fix_formula_in_term_frequency' into filter_extremes
mk2510 Sep 22, 2020
87eef82
fixed merge issues
mk2510 Sep 22, 2020
9cc2f8d
Merge branch 'master' into filter_extremes
mk2510 Sep 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions tests/test_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,3 +381,45 @@ def test_remove_hashtags(self):
s_true = pd.Series("Hi , we will remove you")

self.assertEqual(preprocessing.remove_hashtags(s), s_true)

"""
Filter Extremes
"""

def test_filter_extrems(self):
s = pd.Series(
[
"Here one two one one one go there",
"two go one one one two two two is important",
]
)
s_result = s.pipe(preprocessing.tokenize).pipe(preprocessing.filter_extremes, 3)
s_true = pd.Series(
[
["one", "two", "one", "one", "one", "go"],
["two", "go", "one", "one", "one", "two", "two", "two"],
]
)
pd.testing.assert_series_equal(s_result, s_true)

def test_filter_extrems_min_and_max(self):
s = pd.Series(
[
"Here one two one one one go there",
"two go one one one two two two is important",
"one two three four this is good",
"here one one important statement",
]
)
s_result = s.pipe(preprocessing.tokenize).pipe(
preprocessing.filter_extremes, min_df=2, max_df=3
)
s_true = pd.Series(
[
["two", "go"],
["two", "go", "two", "two", "two", "is", "important"],
["two", "is"],
["important"],
]
)
pd.testing.assert_series_equal(s_result, s_true)
2 changes: 1 addition & 1 deletion texthero/nlp.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ def pos_tag(s: TextSeries) -> pd.Series:
coarse-grained POS has a NOUN value, then the refined POS will give more
details about the type of the noun, whether it is singular, plural and/or
proper.

You can use the spacy `explain` function to find out which fine-grained
POS it is.

Expand Down
87 changes: 81 additions & 6 deletions texthero/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

from texthero import stopwords as _stopwords
from texthero._types import TokenSeries, TextSeries, InputSeries
from texthero import representation

from typing import List, Callable, Union

Expand Down Expand Up @@ -49,7 +50,7 @@ def lowercase(s: TextSeries) -> TextSeries:
"""
Lowercase all texts in a series.


Examples
--------
>>> import texthero as hero
Expand Down Expand Up @@ -143,8 +144,8 @@ def replace_punctuation(s: TextSeries, symbol: str = " ") -> TextSeries:
Replace all punctuation with a given symbol.

Replace all punctuation from the given
Pandas Series with a custom symbol.
It considers as punctuation characters all :data:`string.punctuation`
Pandas Series with a custom symbol.
It considers as punctuation characters all :data:`string.punctuation`
symbols `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~).`


Expand Down Expand Up @@ -367,7 +368,6 @@ def remove_stopwords(
0 Texthero
dtype: object


"""
return replace_stopwords(s, symbol="", stopwords=stopwords)

Expand Down Expand Up @@ -861,7 +861,7 @@ def replace_hashtags(s: TextSeries, symbol: str) -> TextSeries:
"""Replace all hashtags from a Pandas Series with symbol

A hashtag is a string formed by # concatenated with a sequence of
characters, digits and underscores. Example: #texthero_123.
characters, digits and underscores. Example: #texthero_123.

Parameters
----------
Expand Down Expand Up @@ -889,7 +889,7 @@ def remove_hashtags(s: TextSeries) -> TextSeries:
"""Remove all hashtags from a given Pandas Series

A hashtag is a string formed by # concatenated with a sequence of
characters, digits and underscores. Example: #texthero_123.
characters, digits and underscores. Example: #texthero_123.

Examples
--------
Expand All @@ -906,3 +906,78 @@ def remove_hashtags(s: TextSeries) -> TextSeries:
with a custom symbol.
"""
return replace_hashtags(s, " ")


@InputSeries(TokenSeries)
def filter_extremes(
s: TokenSeries, max_words=None, min_df=1, max_df=1.0
) -> TokenSeries:
"""
Decrease the size of your documents by
filtering out words by their frequency.

It is often useful to reduce the size of your dataset
by dropping words in order to
reduce noise and improve performance.
This function removes all words/tokens from
all documents where the
document frequency (=number of documents a term appears in) is
- below min_df
- above max_df.

When min_df or max_df is an integer, then document frequency
is the absolute number of documents that a term
appears in. When it's a float, it is the
proportion of documents a term appears in.

Additionally, only max_words many words are kept.

Parameters
----------
max_words : int, default to None
The maximum number of words/tokens that
are kept, according to term frequency descending.
If None, will consider all features.

min_df : int or float, default to 1
Remove words that have a document frequency
lower than min_df. If float, it represents a
proportion of documents, integer absolute counts.

max_df : int or float, default to 1
Remove words that have a document frequency
higher than max_df. If float, it represents a
proportion of documents, integer absolute counts.

Example
-------
>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(
... [
... "Here one two one one one go there",
... "two go one one one two two two is important",
... ]
... )
>>> s.pipe(hero.tokenize).pipe(hero.filter_extremes, 3)
0 [one, two, one, one, one, go]
1 [two, go, one, one, one, two, two, two]
dtype: object
"""
# Use term_frequency to do the filtering
# for us (cannot do this faster as we
# need to build the document-term matrix
# anyway to filter by min_df and max_df).
s_term_frequency = representation.term_frequency(
s, max_features=max_words, min_df=min_df, max_df=max_df
)

# The remaining tokens are exactly the subcolumn names
# in the term_frequency DocumentTermDF.
tokens_to_keep = set(s_term_frequency.columns)

# Go through documents and only keep tokens in tokens_to_keep.
# FIXME: Parallelize this after #162 is merged.
return s.apply(
lambda token_list: [token for token in token_list if token in tokens_to_keep]
)
14 changes: 7 additions & 7 deletions texthero/representation.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def count(

min_df : float in range [0.0, 1.0] or int, optional, default=1
When building the vocabulary ignore terms that have a document
frequency (number of documents they appear in) strictly
frequency (number of documents they appear in) strictly
lower than the given threshold.
If float, the parameter represents a proportion of documents,
integer absolute counts.
Expand Down Expand Up @@ -154,7 +154,7 @@ def term_frequency(

min_df : float in range [0.0, 1.0] or int, optional, default=1
When building the vocabulary ignore terms that have a document
frequency (number of documents they appear in) strictly
frequency (number of documents they appear in) strictly
lower than the given threshold.
If float, the parameter represents a proportion of documents,
integer absolute counts.
Expand Down Expand Up @@ -233,7 +233,7 @@ def tfidf(s: pd.Series, max_features=None, min_df=1, max_df=1.0,) -> pd.DataFram

min_df : float in range [0.0, 1.0] or int, optional, default=1
When building the vocabulary ignore terms that have a document
frequency (number of documents they appear in) strictly
frequency (number of documents they appear in) strictly
lower than the given threshold.
If float, the parameter represents a proportion of documents,
integer absolute counts.
Expand Down Expand Up @@ -378,7 +378,7 @@ def nmf(
natural language processing to find clusters of similar
texts (e.g. some texts in a corpus might be about sports
and some about music, so they will differ in the usage
of technical terms; see the example below).
of technical terms; see the example below).

Given a document-term matrix (so in
texthero usually a Series after applying
Expand Down Expand Up @@ -424,7 +424,7 @@ def nmf(
>>> # As we can see, the third document, which
>>> # is a mix of sports and music, is placed
>>> # between the two axes (the topics) while
>>> # the other documents are placed right on
>>> # the other documents are placed right on
>>> # one topic axis each.

See also
Expand Down Expand Up @@ -575,11 +575,11 @@ def kmeans(
Performs K-means clustering algorithm on the given input.

K-means clustering is used in natural language processing
to separate texts into k clusters (groups)
to separate texts into k clusters (groups)
(e.g. some texts in a corpus might be about sports
and some about music, so they will differ in the usage
of technical terms; the K-means algorithm uses this
to separate them into two clusters).
to separate them into two clusters).

Given a document-term matrix (so in
texthero usually a Series after applying
Expand Down