Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor documentation API Reference for gensim.summarization #1709

Merged
merged 29 commits into from
Dec 12, 2017
Merged
Changes from 2 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1c6009c
Added docstrings in textcleaner.py
yurkai Nov 12, 2017
851b02c
Merge branch 'develop' into fix-1668
menshikh-iv Nov 12, 2017
5cbb184
Added docstrings to bm25.py
yurkai Nov 13, 2017
31be095
syntactic_unit.py docstrings and typo
yurkai Nov 14, 2017
c6c608b
added doctrings for graph modules
yurkai Nov 16, 2017
d5247c1
keywords draft
yurkai Nov 17, 2017
3031cd0
keywords draft updated
yurkai Nov 20, 2017
4d7b0a9
keywords draft updated again
yurkai Nov 21, 2017
2c8ef28
keywords edited
yurkai Nov 22, 2017
254dce7
pagerank started
yurkai Nov 23, 2017
a2c2102
pagerank summarizer docstring added
yurkai Nov 25, 2017
1a87934
fixed types in docstrings in commons, bm25, graph and keywords
yurkai Nov 27, 2017
0ca8332
fixed types, examples and types in docstrings
yurkai Nov 28, 2017
ed188ae
Merge branch 'develop' into fix-1668
menshikh-iv Dec 11, 2017
20b19d6
fix pep8
menshikh-iv Dec 11, 2017
6ec29bf
fix doc build
menshikh-iv Dec 11, 2017
e2a2e60
fix bm25
menshikh-iv Dec 11, 2017
d7056e4
fix graph
menshikh-iv Dec 11, 2017
400966c
fix graph[2]
menshikh-iv Dec 11, 2017
44f617c
fix commons
menshikh-iv Dec 11, 2017
d2fed6c
fix keywords
menshikh-iv Dec 11, 2017
84b0f3a
fix keywords[2]
menshikh-iv Dec 11, 2017
ba8b1b6
fix mz_entropy
menshikh-iv Dec 11, 2017
2a283d7
fix pagerank_weighted
menshikh-iv Dec 12, 2017
6bd1584
fix graph rst
menshikh-iv Dec 12, 2017
7ec89fa
fix summarizer
menshikh-iv Dec 12, 2017
fa5efce
fix syntactic_unit
menshikh-iv Dec 12, 2017
0014d88
fix textcleaner
menshikh-iv Dec 12, 2017
1a0166a
fix
menshikh-iv Dec 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 176 additions & 5 deletions gensim/summarization/textcleaner.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""Text Cleaner

This module contains functions and processors used for processing text,
extracting sentences from text, working with acronyms and abbreviations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add examples/highlights/motivation here (after you finish with docstrings in this file).

"""


from gensim.summarization.syntactic_unit import SyntacticUnit
from gensim.parsing.preprocessing import preprocess_documents
from gensim.utils import tokenize
Expand All @@ -22,28 +29,102 @@


SEPARATOR = r'@'
RE_SENTENCE = re.compile(r'(\S.+?[.!?])(?=\s+|$)|(\S.+?)(?=[\n]|$)', re.UNICODE)
"""str: special separator used in abbreviations."""
RE_SENTENCE = re.compile(r'(\S.+?[.!?])(?=\s+|$)|(\S.+?)(?=[\n]|$)', re.UNICODE) # backup (\S.+?[.!?])(?=\s+|$)|(\S.+?)(?=[\n]|$)
"""SRE_Pattern: pattern to split text to sentences."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with building here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AB_SENIOR = re.compile(r'([A-Z][a-z]{1,2}\.)\s(\w)', re.UNICODE)
"""SRE_Pattern: pattern for detecting abbreviations. (Example: Sgt. Pepper)"""
AB_ACRONYM = re.compile(r'(\.[a-zA-Z]\.)\s(\w)', re.UNICODE)
"""SRE_Pattern: one more pattern for detecting acronyms."""
AB_ACRONYM_LETTERS = re.compile(r'([a-zA-Z])\.([a-zA-Z])\.', re.UNICODE)
"""SRE_Pattern: one more pattern for detecting acronyms.
(Example: P.S. I love you)"""
UNDO_AB_SENIOR = re.compile(r'([A-Z][a-z]{1,2}\.)' + SEPARATOR + r'(\w)', re.UNICODE)
"""SRE_Pattern: Pattern like AB_SENIOR but with SEPARATOR between abbreviation
and next word"""
UNDO_AB_ACRONYM = re.compile(r'(\.[a-zA-Z]\.)' + SEPARATOR + r'(\w)', re.UNICODE)
"""SRE_Pattern: Pattern like AB_ACRONYM but with SEPARATOR between abbreviation
and next word"""


def split_sentences(text):
"""Splits and returns list of sentences from given text. It preserves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples section should be nice (here and everywhere).

abbreviations set in `AB_SENIOR` and `AB_ACRONYM`.

Parameters
----------
text : str
Input text.

Returns
-------
str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match with return type

List of sentences from text.
"""
processed = replace_abbreviations(text)
return [undo_replacement(sentence) for sentence in get_sentences(processed)]


def replace_abbreviations(text):
"""Replaces blank space to @ separator after abbreviation and next word.

Parameters
----------
sentence : str
Input sentence.

Returns
-------
str:
Sentence with changed separator.

Example
-------
>>> replace_abbreviations("God bless you, please, Mrs. Robinson")
God bless you, please, Mrs.@Robinson
"""
return replace_with_separator(text, SEPARATOR, [AB_SENIOR, AB_ACRONYM])


def undo_replacement(sentence):
"""Replaces `@` separator back to blank space after each abbreviation.

Parameters
----------
sentence : str
Input sentence.

Returns
-------
str:
Sentence with changed separator.

Example
-------
>>> undo_replacement("God bless you, please, Mrs.@Robinson")
God bless you, please, Mrs. Robinson
"""
return replace_with_separator(sentence, r" ", [UNDO_AB_SENIOR, UNDO_AB_ACRONYM])


def replace_with_separator(text, separator, regexs):
"""Returns text with replaced separator if provided regular expressions
were matched.

Parameters
----------
text : str
Input text.
separator : str
The separator between words to be replaced.
regexs : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't match

List of regular expressions.

Returns
-------
str
Text with replaced separators.
"""
replacement = r"\1" + separator + r"\2"
result = text
for regex in regexs:
Expand All @@ -52,11 +133,49 @@ def replace_with_separator(text, separator, regexs):


def get_sentences(text):
"""Sentence generator from provided text. Sentence pattern set in `RE_SENTENCE`.

Parameters
----------
text : str
Input text.

Yields
------
str
Single sentence extracted from text.

Example
-------
>>> text = "Does this text contains two sentences? Yes, it is."
>>> for sentence in get_sentences(text):
>>> print(sentence)
Does this text contains two sentences?
Yes, it is.
"""
for match in RE_SENTENCE.finditer(text):
yield match.group()


def merge_syntactic_units(original_units, filtered_units, tags=None):
"""Processes given sentences and its filtered (tokenized) copies into
SyntacticUnit type. Also adds tags if they are provided to produced units.
Returns a SyntacticUnit list.

Parameters
----------
original_units : list
List of original sintences.
filtered_units : list
List of tokenized sintences.
tags : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list -> list, optional

List of strings used as tags for each unit. None as deafault.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't write about default parameter if this isn't special


Returns
-------
list
SyntacticUnit for each input item.
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use link to type, like :class:~gensim.summarization.syntactic_unit.SyntacticUnit

"""
units = []
for i in xrange(len(original_units)):
if filtered_units[i] == '':
Expand All @@ -74,21 +193,59 @@ def merge_syntactic_units(original_units, filtered_units, tags=None):


def join_words(words, separator=" "):
"""Merges words to a string using separator (blank space as default).

Parameters
----------
words : list
List of words.
separator : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str -> str, optional

The separator bertween elements. Blank set as default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank? I see " ", not ""


Returns
-------
str
String of merged words with separator between them.
"""
return separator.join(words)


def clean_text_by_sentences(text):
""" Tokenizes a given text into sentences, applying filters and lemmatizing them.
Returns a SyntacticUnit list. """
"""Tokenizes a given text into sentences, applying filters and lemmatizing them.
Returns a SyntacticUnit list.

Parameters
----------
text : list
Input text.

Returns
-------
list
SyntacticUnit objects for each sentence.
"""
original_sentences = split_sentences(text)
filtered_sentences = [join_words(sentence) for sentence in preprocess_documents(original_sentences)]

return merge_syntactic_units(original_sentences, filtered_sentences)


def clean_text_by_word(text, deacc=True):
""" Tokenizes a given text into words, applying filters and lemmatizing them.
Returns a dict of word -> syntacticUnit. """
"""Tokenizes a given text into words, applying filters and lemmatizing them.
Returns a dictionary of word -> syntacticUnit.

Parameters
----------
text : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type doesn't match

Input text.
deacc : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool, option - here and everywhere

Remove accentuation (default True).

Returns
-------
dictionary
Word as key, SyntacticUnit as value of dictionary.
"""
text_without_acronyms = replace_with_separator(text, "", [AB_ACRONYM_LETTERS])
original_words = list(tokenize(text_without_acronyms, to_lower=True, deacc=deacc))
filtered_words = [join_words(word_list, "") for word_list in preprocess_documents(original_words)]
Expand All @@ -101,5 +258,19 @@ def clean_text_by_word(text, deacc=True):


def tokenize_by_word(text):
"""Tokenizes input text. Before tokenizing transforms text to lower case and
removes accentuation and acronyms set `AB_ACRONYM_LETTERS`.
Returns generator of words.

Parameters
----------
text : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type doesn't match

Input text.

Returns
-------
generator
Words contained in processed text.
"""
text_without_acronyms = replace_with_separator(text, "", [AB_ACRONYM_LETTERS])
return tokenize(text_without_acronyms, to_lower=True, deacc=True)