forked from dlang/dlang.org
-
Notifications
You must be signed in to change notification settings - Fork 0
/
regular-expression.dd
287 lines (272 loc) · 15.9 KB
/
regular-expression.dd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
Ddoc
$(D_S Regular Expressions,
$(P
$(SMALL by Dmitry Olshansky, the author of std.regex )
)
$(H3 Introduction)
$(P String processing is a daily routine that most applications have to deal with in a one way or another.
It should come as no surprise that many programming languages have standard libraries equipped
with a variety of specialized functions for common string manipulation needs.
The D programming language standard library among others offers a nice assortment in $(STD string),
as well as generic functions from $(STD algorithm) that work with strings.
Still no amount of fixed functionality could cover all needs, as naturally flexible text data
needs flexible solutions.
)
$(P This is where $(LUCKY regular expressions), often succinctly called regexes, come in handy.
Regexes are simple yet powerful language for defining patterns for sets of strings.
Combined with pattern matching, data extraction and substitution, they form a Swiss Army knife of text processing.
They are considered so important that a number of programming languages provide built-in support for regular expressions. Being built-in however does $(B not) necessary
imply $(B faster processing) or having more features. It's just a matter of providing
$(B convenient and friendly syntax) for typical operations, and integrating it well.
)
$(P The D programming language provides a standard library module $(STD regex).
Being a highly expressive systems language, D allows regexes to be $(I implemented efficiently)
within the language itself, yet have good level of readability and usability.
And there a few things a pure D implementation brings to the table that are completely unbelievable
in a traditional compiled language, more on that at the end of article.
)
$(P By the end of article you'll have a good understanding of regular expression capabilities in this library,
and how to utilize its API in a most straightforward and efficient way. Examples in this article assume
that the reader has fair understanding of regex elements, but it's not required.
)
$(H3 A warm up)
$(P How do you check if something is a phone number by looking at it? )
$(P Yes, it's something with numbers, and there may be a country code in front of that...
Sticking to an international format should make it more strict. As this is the first time, let's
put together a full program:
---
import std.stdio, std.regex;
void main(string argv[])
{
string phone = argv[1]; // assuming phone is passed as the first argument on the command line
if(matchFirst(phone, r"^\+[1-9][0-9]* [0-9 ]*$"))
writeln("It looks like a phone number.");
else
writeln("Nope, it's not a phone number.");
}
---
And that's it! Let us however keep in mind the boundaries of regular expressions power - to truly establish a
validness of a phone number, one has to try dialing it or contact the authority.
)
$(P Let's drill down into this tiny example because it actually showcases a lot of interesting things:
$(UL
$(LI A raw string literal of form r"...", that allows writing a regex pattern in its natural notation. )
$(LI $(D matchFirst) function to find the first match in a string if any. To check if there was a match just
test the return value explicitly in a boolean context, such as an $(D if) statement. )
$(LI When matching special regex characters like +, *, (, ), [, ] and $ don't forget to use escaping with backslash(\). )
$(LI Unless there is a lot of text processing going on, it's perfectly fine to pass a plain string as a pattern.
The internal representation used to do the actual work is cached, to speed up subsequent calls. )
)
)
$(P Continuing with the phone number example, it would be useful to get the exact value of
the country code, as well as the whole number. For the sake of experiment let's also explicitly obtain
compiled regex pattern via $(D regex) to see how it works.
)
---
string phone = "+31 650 903 7158"; // fictional, any coincidence is just that
auto phoneReg = regex(r"^\+([1-9][0-9]*) [0-9 ]*$");
auto m = matchFirst(phone, phoneReg);
assert(m); // also boolean context - test for non-empty
assert(!m.empty); // same as the line above
assert(m[0] == "+31 650 903 7158");
assert(m[1] == "31");
// you shouldn't need the regex object type all too often
// but here it is for the curious
static assert(is(typeof(phoneReg) : Regex!char));
---
$(H3 To search and replace)
$(P While getting first match is a common theme in string validation, another frequent need is
to extract all matches found in a piece of text. Picking an easy task, let's see how to
filter out all white space-only lines. There is no special routine for looping over input
like $(D search()) or similar as found in some libraries.
Instead $(D std.regex) provides a natural syntax for looping via plain foreach.
)
---
auto buffer = std.file.readText("regex.d");
foreach (m; matchAll(buffer, regex(r"^.*[^\p{WhiteSpace}]+.*$","m")))
{
writeln(m.hit); // hit is an alias for m[0]
}
---
$(P It may look and feel like a built-in but it just follows the common conventions to do that.
In this case matchAll returns and object that follows the right "protocol" of an input range
simply by having the right set of methods. An input range is a lot like an iterator found
in other languages. Likewise the result of matchFirst and each element of matchAll
is a random access range, a thing that behaves like a "view" of an array.
)
---
auto m = matchAll("Ranges are hot!", r"(\w)\w+(\w)"); // at least 3 "word" symbols
assert(m.front[0] == "Ranges"); // front - first of input range
// m.captures is a historical alias for the first element of match range (.front).
assert(m.captures[1] == m.front[1]);
auto sub = m.front;
assert(sub[2] == "s");
foreach (item; sub)
writeln(item); // will show lines: Ranges, R, s
---
$(P By playing by the rules $(STD regex) gets some nice benefits in interaction with other modules e.g.
this is how one could count non-empty lines in a text buffer:
)
---
import std.algorithm, std.file, std.regex;
auto buffer = std.file.readText(r"std\typecons.d");
int count = count(matchAll(buffer, regex(r"^.*\P{WhiteSpace}+.*$", "m")));
---
$(P
A seasoned regex user catches instantly that Unicode properties are supported with perl-style \p{xxx},
to spice that all of Scripts and Blocks are supported as well. Let us dully note that \P{xxx} means not
having an xxx property, i.e. here not a white space character. Unicode is a vital subject to know, and it won't suffice
to try to cover it here. For details see the accessible $(STD uni) documentation and level 1 of conformance
as per Unicode standard $(LINK2 http://Unicode.org/reports/tr18/, UTS 18).
)
$(P Another thing of importance is the option string - "m", where m stands for multi-line mode.
Historically utilities that supported regex patterns (unix grep, sed, etc.) processed text line by line.
At that time anchors like ^, $ meant the start of the whole input buffer that has been same as that of the line.
As regular expressions got more ubiquitous the need to recognize multiple lines in a chunk of text became apparent.
In such a mode with anchors ^ & $ were defined to match before and after line break literally.
For the curious, modern (Unicode) line break is defined as (\n | \v | \r | \f | \u0085 | \u2028 | \u2029 | \r\n).
Needless to say, one need not turn on multi-line mode if not using any of ^, $.
)
$(P Now that search was covered, the topic suggest that it's about time to do some replacements.
For instance to replace all dates in a text from "MM/dd/YYYY" format to a sortable version of "YYYY-MM-dd":
)
---
auto text = readText(...);
auto replaced = replaceAll(text, r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})".regex, "$3-$1-$2");
---
$(P $(D r"pattern".regex) is just another notation of writing $(D regex("pattern")) called
$(LINK2 spec/function.html#pseudo-member, UFCS) that some may find more
slick.
As can be seen the replacement is controlled by a format string not unlike one in C's famous printf.
The $1, $2, $3 substituted with content of sub-expressions.
Aside from referencing sub-matches, one can include the whole part of input preceding the match via $$(BACKTICK) and $' for the content following right after the match.
)
$(P Now let's aim for something bigger, this time to show that $(STD regex) can do things that
are unattainable by classic textual substitution alone. Imagine you want to translate a web shop catalog so
that it displays prices in your currency. Yes, one can use calculator or estimate it in his/her head,
once having current ratio. Being programmers we can do better, so let's wrap up a simple program that
converts text to use correct prices everywhere. For the sake of example let it be UK pounds and US dollars.
)
---
import std.conv, std.regex, std.range, std.file, std.stdio;
import std.string : format;
void main(string[] argv)
{
immutable ratio = 1.5824; // UK pounds to US dollar as of this writing
auto toDollars(Captures!string price)
{
real value = to!real(price["integer"]);
if (!price["fraction"].empty)
value += 0.01*to!real(price["fraction"]);
return format("$%.2f",value * ratio);
}
string text = std.file.readText(argv[1]);
auto converted = replaceAll!toDollars(text,
regex(r"£\s*(?P<integer>[0-9]+)(\.(?P<fraction>[0-9]{2}))?","g"));
write(converted);
}
---
$(P Getting current conversion rates and supporting more currencies is left as an exercise for the reader.
What at work here is so-called replace with delegate, analogous to a callout ability found in other languages
and regex libraries. The magic is simple: whenever replace finds a match it calls a user supplied callback
on the captured piece, then it uses the return value as replacement.
)
$(P And I just can't resist to spice this example up with yet another feature - named groups.
Names work just like aliases for numbers of captured subexpressions,
meaning that with the same exact regular expression one could as well change affected lines to:
---
real value = to!real(price[1]);
if (!price[3].empty)
value += 0.01*to!real(price[3]);
---
Though that lacks readability and is not as future proof.
)
$(P Also note that optional captures are still represented, it's just they can be an empty string if not matched.
)
$(H3 Split it up)
$(P As core functionality was already presented, let's move on to some extras.
Sometimes it's useful to do almost the opposite of searching - split up input using regex as separator.
Like the following sample, that outputs text by sentences:
)
---
foreach (sentence; splitter(argv[1], regex(r"(?<=[.?!]+(?![?!]))\s*")))
writeln(sentence);
---
$(P Again the type of splitter is range, thus allowing foreach iteration.
Notice the usage of lookaround in regex, it's a neat trick here as stripping off final punctuation is
not our intention. Breaking down this example, (?<=[.?!]) part looks behind for first ., ? or !.
This get us half way to our goal because \s* also matches between elements of punctuation like "?!",
so a negative lookahead is introduced $(I inside lookbehind) to make sure we are past all of the punctuation marks.
Admittedly, barrage of ? and ! makes this regex rather obscure, more then it's actually is.
Observe that there are no restrictions on contents of lookaround expressions,
one can go for lookahead inside lookbehind and so on.
However in general it's recommended to use them sparingly, keeping them as the weapon of last resort.
)
$(H3 Static regex)
$(P Let's stop going for features and start thinking performance. And D has something to offer here.
For one, there is an ability to precompile constant regex at compile-time:
)
---
static r = regex("Boo-hoo");
assert(match("What was that? Boo-hoo?", r));
---
$(P Importantly it's the exact same Regex object that works through all of the API we've seen so far.
It takes next to nothing to initialize, just copy over ready-made structure from the data segment.
)
$(P Roughly ~ 1 μs of run-time to initialize. Run-time version took around 10-20 μs on my machine, keep in mind that it was a trivial pattern.
)
$(P Now stepping further in this direction there is an ability to construct specialized
D code that matches a given regex and compile it instead of using the default run-time engine.
Isn't it so often the case that code starts with regular expressions only to be later re-written
to heaps of scary-looking code to match speed requirements? No need to rewrite - we got you covered.
)
$(P Recalling the phone number example:
)
---
//It's only a 5 characters difference!
string phone = "+31 650 903 7158";
auto phoneReg = ctRegex!r"^\+([1-9][0-9]*) [0-9 ]*$";
auto m = match(phone, phoneReg);
assert(m);
assert(m.captures[0] == "+31 650 903 7158");
assert(m.captures[1] == "31");
---
$(P Interestingly it looks almost exactly the same (a total of 5 letters changed), yet it does all of
the hard work - generates D code for this pattern, compiles it (again) and masquerades under the same API.
Which is the key point - the API stays consistent, yet gets us the significant speed up we sought after.
This fosters quick iteration with the $(D regex) and if desired a decent speed with $(D ctRegex) in the release build (at the cost of slower builds).
)
$(P In this particular case it matched roughly 50% faster for me though I haven't
done comprehensive analysis of this case.
That being said, there is no doubt ctRegex facility is going to improve immensely over time.
We only scratched the surface of new exciting possibilities.
More data on real-world patterns, performance of CT-regex and other benchmarks
$(LINK2 https://github.com/DmitryOlshansky/FReD, here).
)
$(H3 Conclusion)
$(P The article represents a walk-through of $(D std.regex) focused on showcasing the API.
By following a series of easy yet meaningful tasks, its features were exposed in combination,
that underline the elegance and flexibility of this library solution.
The good thing that not only the API is natural, but it also follows established
standards and integrates well with the rest of Phobos.
Putting together its major features for a short-list, $(STD regex) is:
)
$(UL
$(LI Fully Unicode-aware, qualifies to standard full level 1 Unicode support)
$(LI Lots of modern extensions, including unlimited generalized lookaround.
That makes porting regexes from other libraries a breeze.)
$(LI Lean API that consists of a few flexible tools: $(D matchFirst)/$(D matchAll), $(D replaceFirst)/$(D replaceAll) and $(D splitter).)
$(LI Uniform and powerful, with unique abilities like precompiling regex or generating
specially tailored engine at compile-time with ctRegex. All of this fits within the common interface without a notch.)
)
$(P The format of this article is intentionally more of an overview, it doesn't stop to talk in-depth about
certain capabilities like case-insensitive matching (simple casefolding rules of Unicode),
backreferences, lazy quantifiers. And even more features are coming to add more expressive power
and reach greater performance.
)
)
Macros:
H3 = <h3>$0</h3>
DOLLAR = $
STD = $(LINK2 phobos/std_$0.html, std.$0)
SUBNAV=$(SUBNAV_ARTICLES)