Skip to content

Commit

Permalink
details
Browse files Browse the repository at this point in the history
  • Loading branch information
tdhock committed Nov 1, 2023
1 parent 7a023b6 commit f85776e
Showing 1 changed file with 23 additions and 21 deletions.
44 changes: 23 additions & 21 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ Please read and cite my related R Journal papers, if you use this code!
#> 4: setosa Petal Width 0.2
nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*")
#> Species part Length Width
#> <fctr> <char> <num> <num>
#> 1: setosa Petal 1.4 0.2
#> 2: setosa Sepal 5.1 3.5
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*")
Expand Down Expand Up @@ -83,7 +82,7 @@ The main functions provided in nc are:
strings/files, using data.table =by= syntax.
- [[https://cloud.r-project.org/web/packages/nc/vignettes/v3-capture-melt.html][Vignette 3]] discusses =capture_melt_single= and
=capture_melt_multiple= which match a regex to the column names of a
wide data frame, then melt the matching columns. These functions are
wide data frame, then melt/reshape the matching columns. These functions are
especially useful when more than one separate piece of information
can be captured from each column name, e.g. the iris column names
=Petal.Width=, =Sepal.Width=, etc each have two pieces of
Expand Down Expand Up @@ -126,17 +125,15 @@ an older package that provides [[https://cloud.r-project.org/web/packages/namedC
| str_match_all_variable | capture_all_str |
| df_match_variable | capture_first_df |

For an overview of these functions, see my
[[https://github.com/tdhock/namedCapture-article][R journal paper
about namedCapture]] for a usage explanation, and a detailed
comparison with other R regex packages. The main differences between
the functions in =nc= and =namedCapture= are:
For an overview of these functions, and a detailed comparison with
other R regex packages, see my [[https://github.com/tdhock/namedCapture-article][R journal (2019) paper about
namedCapture]]. The main differences between the functions in =nc= and
=namedCapture= are:
- Main =nc= functions all have the =capture_= prefix for easy auto-completion.
- Internally =nc= uses un-named capture groups, whereas =namedCapture=
uses named capture groups. This allows =nc= to support the ICU
engine in addition to PCRE and RE2.
- Output in =nc= is always a data.table (=namedCapture= functions
output either a character matrix or a data.frame).
- Subject names and the capture group named =name= are not treated
specially (in =namedCapture= they are used for rownames of output).
- =nc::capture_first_df= does not prefix subject column names to
capture group column names, whereas
=namedCapture::df_match_variable= does.
Expand All @@ -146,31 +143,36 @@ the functions in =nc= and =namedCapture= are:
- By default the =nc::capture_first_vec= stops with an error if any
subjects do not match, whereas =namedCapture::str_match_variable=
returns NA/missing rows.
- Subject names and the capture group named =name= are not treated
specially (in =namedCapture= they are used for rownames of output).
- =nc::capture_all_str= only supports capturing multiple matches in a
single subject, whereas =namedCapture::str_match_all_named= supports
multiple subjects.
For multiple subjects, use =DT[, nc::capture_all_str(subject), by]=
For handling multiple subjects using =nc=,
use =DT[, nc::capture_all_str(subject), by]=
(see [[https://cloud.r-project.org/web/packages/nc/vignettes/v2-capture-all.html][vignette 2]] for more info).

There are some new functions in =nc= which are not present in
There are several new functions in =nc= which are not present in
=namedCapture=:
- =nc::capture_melt_single= inputs a data.frame, tries to match a
regex to its column names, then melts matching input column names to
a single output column.
- =nc::capture_melt_multiple= inputs a data.frame, tries to
match a regex to its column names, then melts matching input columns
to several output columns of different types.
- =nc::capture_melt_single= and =nc::capture_melt_multiple= use regex
for wide-to-tall data reshaping, see [[https://cloud.r-project.org/web/packages/nc/vignettes/v3-capture-melt.html][Vignette 3]] and my
[[https://journal.r-project.org/archive/2021/RJ-2021-029/index.html][R Journal (2021)]] paper for more info.
- =nc::capture_first_glob= is for reading several regularly named
files into R, see its =help()= page for more info.
- Helper function =nc::measure= can be used to create the
=measure.vars= argument of =data.table::melt=, and
=nc::capture_longer_spec= can be used to create the =spec= argument
of =tidyr::pivot_longer=. See their =help()= pages for more info.
- Helper function =nc::field= is provided for defining patterns (with
no repetition) that match subjects like variable=value, and create a
column/group named variable.
See [[https://cloud.r-project.org/web/packages/nc/vignettes/v2-capture-all.html][vignette 2]] for more info.
- Helper function =nc::alternatives_with_shared_groups= is provided
for defining a pattern containing alternatives with shared
groups. See [[https://cloud.r-project.org/web/packages/nc/vignettes/v5-helpers.html][vignette 5]] for more info.

The new reshaping functions provide functionality similar to packages
tidyr, stats, data.table, reshape, reshape2, cdata, utils, etc. The
main difference is that =nc::capture_melt_*= support named capture
regular expressions with type conversion, which (1) makes it easier to
create/maintain a complex regex, and (2) results in less repetition in
user code. For a detailed comparison see [[https://github.com/tdhock/nc-article][my paper about nc]].
user code. For a detailed comparison see [[https://github.com/tdhock/nc-article][my R Journal (2021) paper about nc]].

0 comments on commit f85776e

Please sign in to comment.