forked from MangoTheCat/rematch2
-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
175 lines (138 loc) · 4.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# rematch2
> Match Regular Expressions with a Nicer 'API'
<!-- badges: start -->
[![R-CMD-check](https://github.com/r-lib/rematch2/workflows/R-CMD-check/badge.svg)](https://github.com/r-lib/rematch2/actions)
[![Codecov test coverage](https://codecov.io/gh/r-lib/rematch2/branch/main/graph/badge.svg)](https://app.codecov.io/gh/r-lib/rematch2?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/rematch2)](https://CRAN.R-project.org/package=rematch2)
<!-- badges: end -->
A small wrapper on regular expression matching functions `regexpr`
and `gregexpr` to return the results in tidy data frames.
---
- [Installation](#installation)
- [Rematch vs rematch2](#rematch-vs-rematch2)
- [Usage](#usage)
- [First match](#first-match)
- [All matches](#all-matches)
- [Match positions](#match-positions)
- [License](#license)
## Installation
Stable version:
```{r eval = FALSE}
install.packages("rematch2")
```
Development version:
```{r eval = FALSE}
pak::pak("r-lib/rematch2")
```
## Rematch vs rematch2
Note that `rematch2` is not compatible with the original `rematch` package.
There are at least three major changes:
* The order of the arguments for the functions is different. In
`rematch2` the `text` vector is first, and `pattern` is second.
* In the result, `.match` is the last column instead of the first.
* `rematch2` returns `tibble` data frames. See
https://github.com/tidyverse/tibble.
## Usage
### First match
```{r}
library(rematch2)
```
With capture groups:
```{r}
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
"76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
```
Named capture groups:
```{r}
isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
```
A slightly more complex example:
```{r}
github_repos <- c(
"metacran/crandb",
"jeroenooms/curl@v0.9.3",
"jimhester/covr#47",
"hadley/dplyr@*release",
"r-lib/remotes@550a3c7d3f9e1493a2ba",
"/$&@R64&3"
)
owner_rx <- "(?:(?<owner>[^/]+)/)?"
repo_rx <- "(?<repo>[^/@#]+)"
subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx <- "(?:@(?<ref>[^*].*))"
pull_rx <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"
subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx <- sprintf(
"^(?:%s%s%s%s|(?<catchall>.*))$",
owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)
```
### All matches
Extract all names, and also first names and last names:
```{r}
name_rex <- paste0(
"(?<first>[[:upper:]][[:lower:]]+) ",
"(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
```
```{r}
not$first
not$last
not$.match
```
### Match positions
`re_exec` and `re_exec_all` are similar to `re_match` and `re_match_all`,
but they also return match positions. These functions return match
records. A match record has three components: `match`, `start`, `end`, and
each component can be a vector. It is similar to a data frame in this
respect.
```{r}
pos <- re_exec(notables, name_rex)
pos
```
Unfortunately R does not allow hierarchical data frames (i.e. a column of a
data frame cannot be another data frame), but `rematch2` defines some
special classes and an `$` operator, to make it easier to extract parts
of `re_exec` and `re_exec_all` matches. You simply query the `match`,
`start` or `end` part of a column:
```{r}
pos$first$match
pos$first$start
pos$first$end
```
`re_exec_all` is very similar, but these queries return lists, with
arbitrary number of matches:
```{r}
allpos <- re_exec_all(notables, name_rex)
allpos
```
```{r}
allpos$first$match
allpos$first$start
allpos$first$end
```
## License
MIT © Mango Solutions, Gábor Csárdi