Added rmarkdown
to SUGGESTS
.
- Fix error in
text_locate()
arising from changes to r-devel. Details here: https://stat.ethz.ch/pipermail/r-devel/2020-February/079061.html
(no changes)
-
Implement
length<-
forcorpus_text
objects. -
Added
str()
method forcorpus_text
objects. Currently just a minimal implementation; this may change in the future. -
Add ANSI styling to
corpus_frame
objects.
-
Remove
text_length()
. Usetext_ntoken()
instead. -
Remove
as_utf8()
,utf8_valid()
,utf8_normalize()
,utf8_encode()
,utf8_format()
,utf8_print()
, andutf8_width()
; these functions are in the utf8 package now.
-
Fix bug in
print.corpus_frame(,row.names = FALSE)
. -
Fix failing test on R-devel.
-
Fix failing tests on testthat 2.0.0.
- Remove
weights
argument fromterm_stats()
andterm_matrix()
.
-
Allow user-supplied stemming functions in
text_filter()
. -
Add
new_stemmer()
function to make a stemming function from a set of (term, stem) pairs. -
Add
stem_snowball()
function for the Snowball stemming algorithms (similar to SnowballC::wordStem, but only stemming "letter" tokens, not "number", "punct", or "symbol"). -
Apply filter combine rules before stemming rather than after.
-
Remove dropped tokens rather than replace them with
NA
.
-
Replace white-space in types with connector (
_
). -
Switch to
"radix"
sort algorithm for consistent, fast term ordering on all platforms, regardless of locale. -
Set
combine = NULL
be default for text filters. -
Make
map_quote
only change apostrophe and single quote characters, not double quote.
-
Fix spurious rchk warnings.
-
Fix failing tests on R version 3.3.
-
Deprecate
text_length()
function in favor oftext_ntoken()
. -
Removed deprecated functions
abbreviations()
,as_corpus()
,as_text()
corpus()
,is_corpus()
,is_text()
,stopwords()
,term_frame()
. -
Removed deprecated
random
argument fromtext_locate()
.
-
New package website, http://corpustext.com
-
Add support for tm
Corpus
and quantedacorpus
objects; all functions expecting text (text_tokens()
,term_matrix()
, etc.) should work seamlessly on these objects. -
Add
gutenberg_corpus()
for downloading a corpus from Project Gutenberg. -
Add
...
arguments to all text functions, for overriding individualtext_filter()
properties. -
Add
sentiment_afinn
, the AFINN sentiment lexicon. -
Add
text_sample()
for getting a random sample of term instances. -
Add
na.omit()
,na.exclude()
,na.fail()
implementations forcorpus_frame
andcorpus_text
.
-
Switch
as_utf8()
default argument tonormalize = FALSE
. -
Re-order
as_corpus_text()
andas_corpus_frame()
arguments; make both accept...
arguments to override individual text filter properties. -
Add missing single-letter initials to English abbreviation list.
-
Adaptively increase buffer size for
read_ndjson()
so that large files can be read quickly. -
Make
summary()
on acorpus_text
object report statistics for the number of tokens and types. -
Switch to 2-letter language codes for stemming algorithms.
-
Fix bug in
utf8_normalize()
when the input contains a backslash (\
). -
Fix bug in
term_matrix()
column names (non-ASCII names were getting replaced by Unicode escapes). -
Work around R Windows bug in converting native to UTF-8; described at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 .
-
Make comparison operations on text vectors keep names if arguments have them.
-
Renamed
corpus()
,as_corpus()
andis_corpus()
tocorpus_frame()
,as_corpus_frame()
andis_corpus_frame()
to avoid name clashes with other packages. -
Renamed
as_text()
andis_text()
toas_corpus_text()
andis_corpus_text()
to avoid name clashes with other packages. -
Rename
term_frame()
toterm_counts()
. -
Deprecate
text_locate()
random
argument; usetext_sample()
instead. -
Remove old deprecated
term_counts()
function; useterm_stats()
instead. -
Deprecate
abbreviations()
andstopwords()
function in favor of data objects:abbreviations_en
,stopwords_en
,stopwords_fr
, etc.
-
Fix buffer overrun for case folding some Greek letters.
-
Fix memory leak in
read_ndjson()
. -
Fix memory leak in JSON object deserialization.
-
Fix memory leak in
term_stats()
.
- Rename
wnaffect
toaffect_wordnet
.
-
Add
corpus()
,as_corpus()
,is_corpus()
functions. -
Make
text_split()
split into evenly-sized blocks, with the 'size' argument specifying the maximum block size. -
Added
text_stats()
function. -
Added
text_match()
function to return matching terms as a factor. -
Implemented text subset assignment operators
[<-
and[[<-
. -
Added
utf8_normalize()
function for translating to NFC normal form, applying case and compatibility maps. -
Added
text_sub()
for getting token sub-sequences. -
Added
text_length()
for text length, includingNA
tokens.
-
Add new vignette, "Introduction to corpus".
-
Add
random
argument totext_locate
for random order. -
Change
format.corpus_frame
to use elastic column widths for text. -
Allow
rows = -1
forprint.corpus_frame
to print all rows. -
Following quanteda, add "will" to the English stop word list.
-
Add special handling for hyphens so that, for example, "world-wide" is a single token (but "-world-wide-" is three tokens).
-
Merged "url" and "symbol" word categories. Removed "other" word category (ignore these characters).
-
Change stemmer so that it only modifies tokens of kind "letter", preserving "number", "symbol", "url", etc.
-
Switched to more efficient
c.corpus_text()
function. -
Make
text_locate()
return "text" column as a factor. -
Constrain text
names()
to be unique, non-missing. -
Added "names" argument to
as_text()
for overriding default names.
-
Added checks for underflow in
read_ndjson()
double deserialization. -
Fixed bug in
text_filter<-
where assignment did not make a deep copy of the object. -
Fixed bug in
utf8_format()
,utf8_print()
,utf8_width()
where internal double quotes were not escaped. -
Fixed rchk, UBSAN warnings.
-
Renamed
term_counts()
toterm_stats()
. -
Removed deprecated functions
token_filter()
andsentence_filter()
. -
Removed
term
column fromtext_locate()
output. -
Removed
map_compat
option fromext_filter()
; useutf8_normalize()
instead if you need to apply compatibility maps.
-
Added
text_filter()
generic. -
Added
text_filter()<-
setter for text vectors. -
Use a text's
text_filter()
attribute as a default in alltext_*
functions expecting a filter argument. -
Added the Federalist Papers dataset (
federalist
). -
Added functions for validating and converting to UTF-8:
as_utf8()
,utf8_valid()
. -
Added functions for formatting and printing utf8 text:
utf8_encode()
,utf8_format()
,utf8_print()
,utf8_valid()
, andutf8_width()
.
-
Handle @mentions, #hashtags, and URLs in word tokenizer.
-
term_counts()
now reports thesupport
for each term (the number of texts containing the term), and has options for restricting output by the minimum and maximum support. -
Added new class
corpus_frame
to support better printing of data frame objects: left-align text data, truncate output to screen width, display emoji on Mac OS. Use this class for all data frame return values. -
Added a "unicode" vignette.
-
Converted the "chinese" demo to a vignette. Thanks to Will Lowe for contributing.
-
Make
text_split()
andterm_frame()
return parent text as a factor. -
Remove
stringsAsFactors
option fromread_ndjson()
; deserialize all JSON string fields as character by default. -
read_ndjson()
de-serializes zero-length arrays asinteger()
,logical()
, etc. instead of asNULL
. -
Allow user interrupts (control-C) in all long-running C computations.
- Deprecate
token_filter()
andsentence_filter()
.
- Fixed a bug where
read_ndjson()
would de-serialize a booleannull
asFALSE
instead ofNA
.
-
Add
text_locate()
, for searching for specific terms in text, reporting the contexts of the occurrences ("Key words in context"). -
Add
text_count()
andtext_detect()
for counting term occurrences or checking for a term's existence in a text. -
Add
text_types()
andtext_ntype()
for returning the unique types in a text, or counting types. -
Add
text_nsentence()
for counting sentences. -
Add
term_frame()
, reporting term frequencies as a data frame with columns"text"
,"term"
, and"count"
.
-
Add transpose argument to
term_matrix()
. -
Add new version of
format.corpus_text()
that is faster and aware of character widths, in particular, Emoji and East Asian character widths. -
Normalize token filter
combine
,drop
,drop_except
,stem_except
arguments, to allow passing cased versions of these arguments. -
Set
combine = abbreviations("english")
by default.
-
Rename
tokens()
totext_tokens()
for consistency; addtext_ntoken()
. -
Rename
term_counts()
min
andmax
arguments tomin_count
andmax_count
.
- Fixed bug where
"u.s"
(a unigram) stems to"u.s"
(a bigram), and then causes forterm_matrix()
select. Thanks to Dmitriy Selivanov for reporting: patperry#3 .
-
Add
ngrams
options forterm_counts()
andterm_matrix()
. -
Add sentence break suppressions (special handling for abbreviations); the default behavior for
text_split(, "sentences")
is to use a set of English abbreviations as suppressions. -
Add option to treat CR and LF like spaces when determining sentence boundaries; this is now the default.
-
Add
term_counts()
min
andmax
options for excluding terms with counts below or above specified limits. -
Add
term_counts()
limit
option to limit the number of reported terms. -
Add
term_counts()
types
option for reporting the types that make up a term. -
Add abbreviations()
function with abbreviation lists for English, French, German, Italian, Portuguese, and Russian (from the Unicode Common Locale Data Repository). -
Add more refined control over
token_filter()
drop cateogries: merged"kana"
, and"ideo"
into"letter"
; split off"punct"
,"mark"
, and"other"
from"symbol"
.
-
Rename
text_filter()
totoken_filter()
. -
Remove
select
argument fromtoken_filter()
, but addselect
toterm_matrix()
arguments. -
Replace
sentences()
function withtext_split()
, which has options for breaking into multi-sentence blocks or multi-token blocks. -
Remove
remove_control
,map_dash
, andremove_space
type normalization options fromtext_filter()
. -
Remove
ignore_empty
token filter option.
-
Rename
"text"
class to"corpus_text"
to avoid name classes with grid. Thanks to Jeroen Ooms for reporting: patperry/corpus#1 -
Rename
"jsondata"
to"corpus_json"
for consistency.
- Fix bug in
read_ndjson()
for reading factors with missing values.
-
Add
term_counts()
function to tabulate term frequencies. -
Add
term_matrix()
function to compute a term frequency matrix. -
Add
text_filter()
option (stem_except
) to exempt specific terms from stemming. -
Add
text_filter()
option (drop
) to drop specific terms, along with option (drop_except
) to exempt specific terms from dropping. -
Add
text_filter()
option (combine
) to combine multi-word phrases like "new york city" into a single term. -
Add text_filter()
option (select
) to select specific terms (excluding all words that are not on this list).
-
Add
stopwords()
function. -
Make
read_ndjson()
decode JSON strings as character or factor (according to whetherstringsAsFactors
isTRUE
) except for fields named"text"
, which get decoded as text objects.
- Rename
text_filter()
optionsfold_case
,fold_dash
,fold_quote
tomap_case
,map_dash
,map_quote
.
- Allow
read_ndjson()
to read from connections, not just files, by reading the file contents into memory first. Use this by default instead of memory mapping.
- Add
text_filter()
optionsdrop_symbol
,drop_number
,drop_letter
,drop_kana
, anddrop_ideo
; these options replace the matched tokens withNA
.
- Fix internal function namespace clashes on Linux and other similar platforms.
- Rename
text_filter()
optiondrop_empty
toignore_empty
.
-
Support for serializing dataset and text objects via
readRDS()
and other native routines. Unfortunately, this support doesn't come for free, and the objects take a little bit more memory. -
Add support for stemming via the Snowball library.
-
More convenient interface for accessing JSON arrays.
-
Make
read_ndjson()
return a data frame by default, not a"jsondata"
object.
-
Rename
as.text()
/is.text()
toas_text()
/is_text()
; makeas_text()
retain names, work on S3 objects. -
Rename
read_json()
toread_ndjson()
to not clash with jsonlite. -
Rename
"dataset"
type to"jsondata"
.
-
First CRAN release.
-
Added Windows support.
-
Added support for setting names on text objects.
-
Added documentation.
- First milestone, with support for JSON decoding, text segmentation, and text normalization.