-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathbook.tex
2731 lines (2205 loc) · 130 KB
/
book.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\hypertarget{acknowledgments}{%
\chapter{Acknowledgments}\label{acknowledgments}}
I thank my friend Miki Tebeka, who invited me to write this book, albeit
in slightly different form than the version you see. I am very grateful
to my friend Brad Huntting and partner Mary Ann Sushinsky who provided
clever ideas in the directions of these puzzles. Thanks to my colleague
Lucy Wan, who provided a final proofreading, finding the many silly
typos missed on many prior reads.
A number of other friends and family members listened to me enumerate
the foibles of another publisher who clings to a cargo-culted toolchain.
With ambivalence, I thank Noam Chomsky for arranging computability into
a neat hierarchy, with regular expressions at the bottom.
\newpage
\begin{figure}
\centering
\includegraphics{images/Liberty.png}
\caption{La Liberté éclairant le monde}
\end{figure}
\hypertarget{rights-of-woman}{%
\chapter{Rights of (Wo)Man}\label{rights-of-woman}}
``Whatever is my right as a man is also the right of another; and it
becomes my duty to guarantee as well as to possess.''
\begin{quote}
― Thomas Paine
\end{quote}
\begin{center}\rule{0.5\linewidth}{0.5pt}\end{center}
This book is copyright of David Mertz, 2021.
It is licensed as Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA
4.0). The source is available at:
\begin{quote}
https://github.com/DavidMertz/RegEx\_Puzzles.
\end{quote}
Please feel free to utilize it within these terms (but give me credit).
\begin{figure}
\centering
\includegraphics{images/Striated_Verso.png}
\caption{Striated\_Verso}
\end{figure}
\hypertarget{credits}{%
\chapter{Credits}\label{credits}}
Cover image: ``Alien DNA'' by Sven Geier, 2015. Used by permission.
Back cover photo by Mary Ann Sushinsky, 2018. Used by permission.
Images by Jay Trolinger (https://www.spoonflower.com/profiles/ormolu)
used by permission: Basket-Verso; Root5spiral-Verso; Striated-Verso;
Olives-Verso; Basket-Recto; Naive-Scribble-Verso; Root5spiral-Recto;
Naive-Scribble-Recto; Striated-Recto; Olives-Recto
Leo Reynolds (CC BY-NC-SA 2.0): joker-48067975746
Pixabay (https://pixabay.com/service/license/): clown-28772
Dmitry Fomin (CC0 1.0): Atlas\_deck\_joker\_black
freesvg.org (Public Domain): johnny-automatic-right-hand;
johnny-automatic-left-hand; johnny-automatic-left-hand;
clown-1549219095; Prismatic-DNA-Helix-Circles-3
OpenClipArt (Public Domain): Elegant-Flourish-Frame-Extrapolated-19
Samuel MacGregor Liddel Mathews, ``The Goetia: The Lesser Key of Solomon
the King'' (1904, Public Domain): N\_A\_B\_E\_R\_I\_U\_S
\emph{Stuck in the Middle with You}, by Gerry Rafferty and Joe Egan
(Stealer Wheels), 1973:
\begin{quote}
Clowns to the left of me! / Jokers to the right! / Here I am stuck in
the middle with you.
\end{quote}
\begin{figure}
\centering
\includegraphics{images/clown-1549219095.svg}
\caption{clown-1549219095}
\end{figure}
\hypertarget{preface}{%
\chapter{Preface}\label{preface}}
Regular expressions---sometimes given the playful back formation
\emph{regexen} or more neutrally \emph{regex}---are a powerful and
compact way of describing patterns in text. Many tutorials and ``cheat
sheets'' exist to understand their syntax and semantics in a formally
correct manner. I encourage you to read some of those, if you have not
already.
These puzzles begin at a certain point where the formal descriptions
leave off. As you work with regexen, you will find subtle pitfalls. A
pattern that seems like it should obviously match one thing actually
matches something slightly different than you intended. Or perhaps a
match pattern has ``pathological'' behavior and takes far too long. Or
sometimes it is simply that a more concise pattern would be clearer in
describing what you actually wish to match.
A great many programming languages, libraries, and tools support regular
expressions, with relatively minor variations in the syntax used. Such
software includes \texttt{{[}efr{]}?grep}, \texttt{sed}, \texttt{awk},
\emph{Perl}, \emph{Java}, \emph{.NET}, \emph{JavaScript}, \emph{Julia},
\emph{XML Schema}, or indeed, pretty much every other programming
language via a library.
For this book, we will use Python to pose these puzzles. In particular,
we will use the standard library module \texttt{re}. Often code samples
are used in puzzles and in explanation; where I wish to show the output
from code, the example emulates the Python shell with lines starting
with \texttt{\textgreater{}\textgreater{}\textgreater{}} (or continuing
with \texttt{...}). Outputs are echoed without a prompt in this case.
Where code defines a function that is not necessarily executed in the
mention, only the plain code is shown.
While you are reading this book, I strongly encourage you to keep open
an interactive Python environment. Many tools enable this, such as the
Python REPL (read-evaluate-print-loop) itself, or the IPython enhanced
REPL, or Jupyter notebooks, or the IDLE editor that comes with Python,
or indeed most modern code editors and IDEs (integrated development
environments). A number of online regular expression testers are also
available, although those will not capture the the Python calling
details. Explanations will follow each puzzle, but trying to work it out
in code before reading it is worthwhile. C'mon, not thinking about a
puzzle before reading the solution is a cop-out.
\hypertarget{quantifiers-and-special-sub-patterns}{%
\chapter{Quantifiers and Special
Sub-Patterns}\label{quantifiers-and-special-sub-patterns}}
Solving the puzzles in this section will require you to have a good
understanding of the different quantifiers that regular expressions
provide, and to pay careful attention to when you should use subpatterns
(themselves likely quantified).
\begin{figure}
\centering
\includegraphics{images/Prismatic-DNA-Helix-Circles-3.svg}
\caption{Prismatic-DNA-Helix-Circles-3}
\end{figure}
\newpage
\hypertarget{wildcard-scope}{%
\section{Wildcard Scope}\label{wildcard-scope}}
A powerful element of Python regular expression syntax---shared by many
other regex engines---is the option of creating either ``greedy'' or
``non-greedy'' matches. The former matches as much as it possibly can,
as long as it finds the later part of a pattern. The latter matches as
little as it possibly can to reach the next part of a pattern.
Suppose you have these two regular expressions:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pat1 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}x.*y\textquotesingle{}}\NormalTok{) }\CommentTok{\# greedy}
\NormalTok{pat2 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}x.*?y\textquotesingle{}}\NormalTok{) }\CommentTok{\# non{-}greedy}
\end{Highlighting}
\end{Shaded}
And also the following block of text that you want to match. You can
think of it as a sort of \emph{lorem ipsum} that only has `X' words, if
you will.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{txt }\OperatorTok{=} \StringTok{"""}
\StringTok{xenarthral xerically xenomorphically xebec xenomania}
\StringTok{xenogenic xenogeny xenophobically xenon xenomenia}
\StringTok{xylotomy xenogenies xenografts xeroxing xenons xanthous}
\StringTok{xenoglossy xanthopterins xenoglossy xeroxed xenophoby}
\StringTok{xenoglossies xanthoxyls xenoglossias xenomorphically}
\StringTok{xeroxes xanthopterin xebecs xenodochiums xenodochium}
\StringTok{xylopyrography xanthopterines xerochasy xenium xenic}
\StringTok{"""}
\end{Highlighting}
\end{Shaded}
You'd like to match all and only words that start with `X' and end with
`Y'. What pattern makes sense to use, and why? The code to find the
words can look like:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{xy\_words }\OperatorTok{=}\NormalTok{ re.findall(\_pat, txt)}
\end{Highlighting}
\end{Shaded}
Before you turn the page\ldots{}
\textbf{Think about what each pattern will match.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
Did this puzzle fool you? Welcome to the world of regular expressions!
Both \texttt{pat1} and \texttt{pat2} match the wrong thing, but in
different ways.
If you liked \texttt{pat1}, you've greedily matched too much. The `y'
might occur in later words (per line), and the match won't end until the
last `y' on a line.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \ControlFlowTok{for}\NormalTok{ match }\KeywordTok{in}\NormalTok{ re.findall(pat1, txt):}
\NormalTok{... }\BuiltInTok{print}\NormalTok{(match)}
\NormalTok{...}
\NormalTok{xenarthral xerically xenomorphically}
\NormalTok{xenogenic xenogeny xenophobically}
\NormalTok{xylotomy}
\NormalTok{xenoglossy xanthopterins xenoglossy xeroxed xenophoby}
\NormalTok{xenoglossies xanthoxyls xenoglossias xenomorphically}
\NormalTok{xylopyrography xanthopterines xerochasy}
\end{Highlighting}
\end{Shaded}
On each line, the greedy pattern started at the first `x', which is
often not what you want. Moreover, most lines match multiple words, with
only the line beginning with `xylotomy' happening to be the isolated
word we actually want. The line that begins with `xeroxes' is not
matched at all, which is what we want.
If you liked \texttt{pat2} you often get words, but at other times
either too much \emph{or too little} might be matched. For example, if
`xy' occurs in a longer word, either as a prefix or in the middle, it
can also match.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \ControlFlowTok{for}\NormalTok{ match }\KeywordTok{in}\NormalTok{ re.findall(pat2, txt):}
\NormalTok{... }\BuiltInTok{print}\NormalTok{(match)}
\NormalTok{...}
\NormalTok{...}
\NormalTok{xenarthral xerically}
\NormalTok{xenomorphically}
\NormalTok{xenogenic xenogeny}
\NormalTok{xenophobically}
\NormalTok{xy}
\NormalTok{xenoglossy}
\NormalTok{xanthopterins xenoglossy}
\NormalTok{xeroxed xenophoby}
\NormalTok{xenoglossies xanthoxy}
\NormalTok{xenoglossias xenomorphically}
\NormalTok{xy}
\NormalTok{xanthopterines xerochasy}
\end{Highlighting}
\end{Shaded}
By being non-greedy, we stop when the first `y' is encountered, but as
you see, that still is not quite what we want.
What we actually need to focus on for this task is the \emph{word
boundaries}. Things that are not lowercase letters cannot be part of
matches. In this simple case, non-letters are all spaces and newlines,
but other characters might occur in other texts.
We can be greedy to avoid matching prefixes or infixes, but we also want
to ignore non-letter characters.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat3 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}x[a{-}z]*y\textquotesingle{}}\NormalTok{)}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \ControlFlowTok{for}\NormalTok{ match }\KeywordTok{in}\NormalTok{ re.findall(pat3, txt):}
\NormalTok{... }\BuiltInTok{print}\NormalTok{(match)}
\NormalTok{...}
\NormalTok{xerically}
\NormalTok{xenomorphically}
\NormalTok{xenogeny}
\NormalTok{xenophobically}
\NormalTok{xylotomy}
\NormalTok{xenoglossy}
\NormalTok{xenoglossy}
\NormalTok{xenophoby}
\NormalTok{xanthoxy}
\NormalTok{xenomorphically}
\NormalTok{xylopyrography}
\NormalTok{xerochasy}
\end{Highlighting}
\end{Shaded}
Everything we matched, anywhere on each line, had an `x', some other
letters (perhaps including `x's or 'y's along the way), then a 'y'.
Whatever came after each match was a non-letter character.
\newpage
\hypertarget{words-and-sequences}{%
\section{Words and Sequences}\label{words-and-sequences}}
In the previous problem, we identified words that started with `x' and
ended with `y'. You may have noticed, however, that we had already
included the assumption that all the words started with `x'. Perhaps
your solution was clever enough not to fall for the danger shown in this
puzzle. Namely, perhaps not all words will actually start with `x' to
begin with; i.e.~if we try to apply our previous regex to such text.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ txt }\OperatorTok{=} \StringTok{"""}
\StringTok{expurgatory xylometer xenotime xenomorphically exquisitely}
\StringTok{xylology xiphosurans xenophile oxytocin xylogen}
\StringTok{xeriscapes xerochasy inexplicably exabyte inexpressibly}
\StringTok{extremity xiphophyllous xylographic complexly vexillology}
\StringTok{xanthenes xylenol xylol xylenes coextensively}
\StringTok{"""}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat3 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}x[a{-}z]*y\textquotesingle{}}\NormalTok{)}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.findall(pat3, txt)}
\NormalTok{[}\StringTok{\textquotesingle{}xpurgatory\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xquisitely\textquotesingle{}}\NormalTok{,}
\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xplicably\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xaby\textquotesingle{}}\NormalTok{,}
\StringTok{\textquotesingle{}xpressibly\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xtremity\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xiphophy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xly\textquotesingle{}}\NormalTok{,}
\StringTok{\textquotesingle{}xillology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xtensively\textquotesingle{}}\NormalTok{]}
\end{Highlighting}
\end{Shaded}
As you can see, we matched a number of substrings within words, not only
whole words. What pattern can you use to actually match only words that
start with `x' and end with `y'?
Before you turn the page\ldots{}
\textbf{Think about what defines word boundaries.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\begin{figure}
\centering
\includegraphics{images/Basket_Recto.png}
\caption{Basket\_Recto}
\end{figure}
\newpage
There are a few ways you might approach this task. The easiest is to use
the explicit ``word boundary'' special \emph{zero-width match} pattern,
spelled as \texttt{\textbackslash{}b} in Python and many other regular
expression engines.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat4 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}\textbackslash{}bx[a{-}z]*y\textbackslash{}b\textquotesingle{}}\NormalTok{)}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.findall(pat4, txt)}
\NormalTok{[}\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{]}
\end{Highlighting}
\end{Shaded}
Less easy ways to accomplish this include using lookahead and lookbehind
to find non-matching characters that must ``bracket'' the actual match.
For example:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat5 }\OperatorTok{=} \VerbatimStringTok{r\textquotesingle{}(?\textless{}=\^{}|(?\textless{}=[\^{}a{-}z]))x[a{-}z]+y(?=$|[\^{}a{-}z])\textquotesingle{}}\NormalTok{)}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.findall(pat5, txt)}
\NormalTok{[}\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{]}
\end{Highlighting}
\end{Shaded}
One trick here is that when we perform a lookbehind assertion, it must
have a fixed width of the match. However, words in our list might either
follow spaces or occur at the start of a line. So we need to create an
alternation between the zero-width lookbehind and the one non-letter
character lookbehind. For the lookahead element, it is enough to say it
is \emph{either} end-of-line (\texttt{\$}) \emph{or} is a non-letter
(\texttt{{[}\^{}a-z{]}}).
\newpage
\hypertarget{endpoint-classes}{%
\section{Endpoint Classes}\label{endpoint-classes}}
This puzzle continues the word matching theme of the last two puzzles.
However, here we have a new wrinkle. We would like to identify
\emph{both} words that start with `x' and end with `y', but \emph{also}
words that start with `y' and end with `x'.
Remembering the word boundary special zero-width pattern we already saw,
a first try at this task might be:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ txt }\OperatorTok{=} \StringTok{"""}
\StringTok{expurgatory xylometer yex xenomorphically exquisitely}
\StringTok{xylology xiphosurans xenophile yunx oxytocin xylogen}
\StringTok{xeriscapes xerochasy inexplicably yonderly inexpressibly}
\StringTok{extremity xerox xylographic complexly vexillology}
\StringTok{xanthenes xylenol xylol yexing xylenes coextensively}
\StringTok{\textgreater{}\textgreater{}\textgreater{} pat6 = re.compile(r\textquotesingle{}}\CharTok{\textbackslash{}b}\StringTok{[xy][a{-}z]*[xy]}\CharTok{\textbackslash{}b}\StringTok{\textquotesingle{})}
\StringTok{\textgreater{}\textgreater{}\textgreater{} re.findall(pat6, txt)}
\StringTok{[\textquotesingle{}yex\textquotesingle{}, \textquotesingle{}xenomorphically\textquotesingle{}, \textquotesingle{}xylology\textquotesingle{}, \textquotesingle{}yunx\textquotesingle{}, \textquotesingle{}xerochasy\textquotesingle{},}
\StringTok{\textquotesingle{}yonderly\textquotesingle{}, \textquotesingle{}xerox\textquotesingle{}]}
\StringTok{"""}
\end{Highlighting}
\end{Shaded}
What went wrong there? Clearly we matched some words we do not want,
even though all of them began with `x' or `y' and ended with `x' or `y'.
Before you turn the page\ldots{}
\textbf{Try to refine the regular expression to match what we want.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
The first pattern shown allows for either `x' or `y' to occur at either
the beginning or the end of a word. The word boundaries are handled
fine, but this allows words both beginning and ending with `x', and
likewise beginning and ending with `y'. The character classes at each
end of the overall pattern are independent.
This may seem obvious on reflection, but it is very much like errors I
myself have made embarrassingly many times in real code. A robust
approach is simply to list everything you want as alternatives in a
pattern.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat7 }\OperatorTok{=}\NormalTok{ re.}\BuiltInTok{compile}\NormalTok{(}\VerbatimStringTok{r\textquotesingle{}\textbackslash{}b((x[a{-}z]*y)|(y[a{-}z]*x))\textbackslash{}b\textquotesingle{}}\NormalTok{)}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ [m[}\DecValTok{0}\NormalTok{] }\ControlFlowTok{for}\NormalTok{ m }\KeywordTok{in}\NormalTok{ re.findall(pat7, txt)]}
\NormalTok{[}\StringTok{\textquotesingle{}yex\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}yunx\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{]}
\end{Highlighting}
\end{Shaded}
In this solution, there is a little bit of Python-specific detail in the
function API. The function \texttt{re.findall()} returns tuples when a
pattern contains multiple groups. Group 1 will be the whole word, but
one or the other of group 2 and 3 will be blank, i.e.:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.findall(pat7, txt)}
\NormalTok{[(}\StringTok{\textquotesingle{}yex\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}yex\textquotesingle{}}\NormalTok{),}
\NormalTok{(}\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xenomorphically\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{),}
\NormalTok{(}\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xylology\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{),}
\NormalTok{(}\StringTok{\textquotesingle{}yunx\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}yunx\textquotesingle{}}\NormalTok{),}
\NormalTok{(}\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}xerochasy\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{)]}
\end{Highlighting}
\end{Shaded}
\newpage
\hypertarget{a-configuration-format}{%
\section{A Configuration Format}\label{a-configuration-format}}
This exercise requires just a little bit of Python itself, but is mainly
about choosing the right regular expression. Let's suppose you have a
configuration format that looks like this:
\begin{verbatim}
config = """
3 = foobar
14=baz
9= fizzbuzz
21=more_stuff,here
"""
\end{verbatim}
With a little bit of code, and using a regular expression, you wish to
convert text in this format to a dictionary mapping the numbers to the
left of the equal sign to the strings to the right. For example, the
above example would produce:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{\{}\DecValTok{3}\NormalTok{: }\StringTok{\textquotesingle{}foobar\textquotesingle{}}\NormalTok{, }\DecValTok{14}\NormalTok{: }\StringTok{\textquotesingle{}baz\textquotesingle{}}\NormalTok{, }\DecValTok{9}\NormalTok{: }\StringTok{\textquotesingle{}fizzbuzz\textquotesingle{}}\NormalTok{, }\DecValTok{21}\NormalTok{: }\StringTok{\textquotesingle{}more\_stuff,here\textquotesingle{}}\NormalTok{\}}
\end{Highlighting}
\end{Shaded}
Before you turn the page\ldots{}
\textbf{Remember that shapes have edges.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
As the example shows, there seems to be flexibility in spaces around the
two sides of the equal sign. We should probably assume zero or more
spaces are permitted on either side. The format is probably slightly
surprising in that we would more commonly use words on the left and
numbers on the right in most formats; but it is well-defined enough, and
we can stipulate it has a purpose.
The easiest way to capture the relevant information is probably by using
groups for each side, which will be exposed by \texttt{re.findall()} and
other regular expression functions. We \emph{almost} get the right
answer with this:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \BuiltInTok{dict}\NormalTok{(re.findall(}\VerbatimStringTok{r\textquotesingle{}\^{}(\textbackslash{}d+) *= *(.*)$\textquotesingle{}}\NormalTok{, s, re.MULTILINE))}
\NormalTok{\{}\StringTok{\textquotesingle{}3\textquotesingle{}}\NormalTok{: }\StringTok{\textquotesingle{}foobar\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}14\textquotesingle{}}\NormalTok{: }\StringTok{\textquotesingle{}baz\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}9\textquotesingle{}}\NormalTok{: }\StringTok{\textquotesingle{}fizzbuzz\textquotesingle{}}\NormalTok{,}
\StringTok{\textquotesingle{}21\textquotesingle{}}\NormalTok{: }\StringTok{\textquotesingle{}more\_stuff,here\textquotesingle{}}\NormalTok{\}}
\end{Highlighting}
\end{Shaded}
Notice that we required the ``multiline'' modifier to match on each line
of the string. The one problem is that the puzzle requested that numbers
appear as numbers, not as strings of digits. There are a number of ways
we might achieve that in Python, but one easy one is:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ \{}\BuiltInTok{int}\NormalTok{(k): v }\ControlFlowTok{for}\NormalTok{ k, v }\KeywordTok{in}
\NormalTok{ re.findall(}\VerbatimStringTok{r\textquotesingle{}\^{}(\textbackslash{}d+) *= *(.*)$\textquotesingle{}}\NormalTok{, s, re.MULTILINE)\}}
\NormalTok{\{}\DecValTok{3}\NormalTok{: }\StringTok{\textquotesingle{}foobar\textquotesingle{}}\NormalTok{, }\DecValTok{14}\NormalTok{: }\StringTok{\textquotesingle{}baz\textquotesingle{}}\NormalTok{, }\DecValTok{9}\NormalTok{: }\StringTok{\textquotesingle{}fizzbuzz\textquotesingle{}}\NormalTok{, }
\DecValTok{21}\NormalTok{: }\StringTok{\textquotesingle{}more\_stuff,here\textquotesingle{}}\NormalTok{\}}
\end{Highlighting}
\end{Shaded}
\begin{figure}
\centering
\includegraphics{images/Root5spiral_Recto.png}
\caption{Root5spiral\_Recto}
\end{figure}
\newpage
\hypertarget{the-human-genome}{%
\section{The Human Genome}\label{the-human-genome}}
Genomics commonly uses a format called FASTA to represent genetic
sequences. This puzzle uses a subset of the overall format. Let's
provide just a few quick tips. The letters `A', `C', `G', `T' represent
nucleotide bases in DNA. FASTA may also contain the symbol `N' for
``unknown nucleotide'' and `-' for ``gap of indeterminate length.''
As well, in biological organisms, spans of DNA are terminated by
``telomeres,'' which are special sequences indicating that the read
mechanism should stop transcription and form a protein. Telomeres are
often repeated as much as thousands of times at the ends of sequences.
In a gross simplification for this puzzle, let's suppose that three or
more repetitions of a telomere indicate the end of a sequence for a
protein. In vertebrates, the telomere used is `TTAGGG'.
In this puzzle, we will ignore the marking of the start of a
protein-encoding region, and just assume that all of our strings begin a
potential protein encoding.
You would like to create a regular expression that represents a
``specific protein encoding'' from a (simplified) FASTA fragment. In
particular, we need to know exactly which nucleotides are present, and
gaps or unknown nucleotides will prevent a match. Moreover, even the
telomere repetitions at the end are not permitted (for this puzzle) to
have gaps or unknowns.
For this puzzle, assume that all the FASTA symbols are on a single line.
Normally as published they have a fixed width less than 80 characters;
but newlines are simply ignored. An example of a match:\footnote{Some
characters shown have Unicode combining diacritics to draw your eye to
features. Technically, therefore, some characters shown are not
actually the FASTA codes they look similar to.}
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \ImportTok{from}\NormalTok{ textwrap }\ImportTok{import}\NormalTok{ wrap}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \BuiltInTok{print}\NormalTok{(}\StringTok{\textquotesingle{}}\CharTok{\textbackslash{}n}\StringTok{\textquotesingle{}}\NormalTok{.join(wrap(valid, }\DecValTok{60}\NormalTok{)))}
\NormalTok{CCCTGAATAATCAAGGTCACAGACCAGTTAGAATGGTTTAGTGTGGAAAGCGGGAAACGA}
\NormalTok{AAAGCCTCTCTGAATCCTGCGCACCGAGATTCTCCCAAGGCAAGGCGAGGGGCTGTATTG}
\NormalTok{CAGGGTTCAACTGCAGCGTCGCAACTCAAATGCAGCATTCCTAATGCACACATGACACCC}
\NormalTok{AAAATATAACAGACATATTACTCATGGAGGGTGAGGGTGAGGGTGAGGGT̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠}
\NormalTok{G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠}
\end{Highlighting}
\end{Shaded}
Using a good pattern, we can identify everything up to, but not
including, the telomere repetitions.
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ coding }\OperatorTok{=}\NormalTok{ re.search(pat, valid).group()}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \BuiltInTok{print}\NormalTok{(}\StringTok{\textquotesingle{}}\CharTok{\textbackslash{}n}\StringTok{\textquotesingle{}}\NormalTok{.join(wrap(coding, }\DecValTok{60}\NormalTok{)))}
\NormalTok{CCCTGAATAATCAAGGTCACAGACCAGTTAGAATGGTTTAGTGTGGAAAGCGGGAAACGA}
\NormalTok{AAAGCCTCTCTGAATCCTGCGCACCGAGATTCTCCCAAGGCAAGGCGAGGGGCTGTATTG}
\NormalTok{CAGGGTTCAACTGCAGCGTCGCAACTCAAATGCAGCATTCCTAATGCACACATGACACCC}
\NormalTok{AAAACTATAACAGACATATTACTCATGGAGGGTGAGGGTGGGGGTGAGGG}
\end{Highlighting}
\end{Shaded}
The next two are failures. The first does not have sufficient
repetitions. The second has a non-specific nucleotide symbol:
\begin{Shaded}
\begin{Highlighting}[]
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \BuiltInTok{print}\NormalTok{(}\StringTok{\textquotesingle{}}\CharTok{\textbackslash{}n}\StringTok{\textquotesingle{}}\NormalTok{.join(wrap(bad\_telomere, }\DecValTok{60}\NormalTok{)))}
\NormalTok{CCCTGAATAATCAAGGTCACAGACCAGTTAGAATGGTTTAGTGTGGAAAGCGGGAAACGA}
\NormalTok{AAAGCCTCTCTGAATCCTGCGCACCGAGATTCTCCCAAGGCAAGGCGAGGGGCTGTATTG}
\NormalTok{CAGGGTTCAACTGCAGCGTCGCAACTCAAATGCAGCATTCCTAATGCACACATGACACCC}
\NormalTok{AAAATATAACAGACATATTACTCATGGAGGGTGAGGGTGAGGGTGAGGGT̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠}
\NormalTok{G̠TT̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠TT̠T̠A̠G̠G̠G̠GT̠T̠A̠G̠G̠G̠GT̠T̠A̠G̠G̠G̠AT̠T̠A̠G̠G̠G̠T̠T̠A̠G̠G̠G̠TTTAGG}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.search(pat, bad\_telomere) }\KeywordTok{or} \StringTok{"No Match"}
\CommentTok{\textquotesingle{}No Match\textquotesingle{}}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \BuiltInTok{print}\NormalTok{(}\StringTok{\textquotesingle{}}\CharTok{\textbackslash{}n}\StringTok{\textquotesingle{}}\NormalTok{.join(wrap(unknown\_nucleotide, }\DecValTok{60}\NormalTok{)))}
\NormalTok{CCCTGAATAATCAAGGTCACAGACCAGTTAGAATGGTTTAGTGTGGAAAGCGGGAAACGA}
\NormalTok{AAAGCCTCN̅CTGAATCCTGCGCACCGAGATTCTCCCAAGGCAAGGCGAGGGGCTGTATTG}
\NormalTok{CAGGGTTCAACTGCAGCGTCGCAACTCAAATGCAGCATTCCTAATGCACACATGACACCC}
\NormalTok{AAAATATAACAGACATATTACTCATGGAGGGTGAGGGTGAGGGTGAGGGTTAGGGTTAGG}
\NormalTok{GTTTAGGGTTAGGGTTAGGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGG}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ re.search(pat, unknown\_nucleotide) }\KeywordTok{or} \StringTok{"No Match"}
\CommentTok{\textquotesingle{}No Match\textquotesingle{}}
\end{Highlighting}
\end{Shaded}
In the one mismatch, the first several, but not all trailing bases are
valid telomeres. In the second mismatch, the `N' symbol is used. Both of
these are valid FASTA encoding, but not the sequences specified for
puzzle.
Before you turn the page\ldots{}
\textbf{Remember the central dogma of molecular biology.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
There are a few key aspects to keep in mind in designing your regular
expression. You want to make sure that your pattern begins at the start
of the candidate sequence. Otherwise, you could easily match only a
valid tail of it.
From there, any sequence of `C', `A', `T', and `G' symbols is permitted.
However, you definitely want to be non-greedy in matching them since no
telomeres should be included. The telomere may be repeated any number of
times, but at least three. Optionally, repeated telomeres can be
required to continue until the end of the candidate sequence, so we must
match \texttt{\$} \emph{inside} the lookahead pattern.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pat }\OperatorTok{=} \VerbatimStringTok{r\textquotesingle{}\^{}([CATG]+?)(?=(TTAGGG)\{3,\}$)\textquotesingle{}}
\end{Highlighting}
\end{Shaded}
\hypertarget{pitfalls-and-sand-in-the-gears}{%
\chapter{Pitfalls and Sand in the
Gears}\label{pitfalls-and-sand-in-the-gears}}
As compact and expressive as regular expressions can be, there are times
when they simply go disastrously wrong. Be careful to avoid pitfalls,
and at least to understand and identify where such difficulties arise.
\newpage
\hypertarget{catastrophic-backtracking}{%
\section{Catastrophic Backtracking}\label{catastrophic-backtracking}}
In this puzzle, we imagine a certain message protocol (as we do in many
of the other puzzles). We have n message alphabet that consists of the
following symbols:
\begin{longtable}[]{@{}lll@{}}
\toprule
Codepoint & Name & Appearance\tabularnewline
\midrule
\endhead
U+25A0 & Black Square & ■\tabularnewline
U+25AA & Black Small Square & ▪\tabularnewline
U+25CB & White Circle & ○\tabularnewline
U+25C9 & Fisheye & ◉\tabularnewline
U+25A1 & White Square & □\tabularnewline
U+25AB & White Small Square & ▫\tabularnewline
U+25B2 & Black Up Triangle & ▲\tabularnewline
U+25CF & Black Circle & ●\tabularnewline
U+2404 & End Transmition & \texttt{!} (herein)\tabularnewline
\bottomrule
\end{longtable}
These geometric characters are attractive and are chosen to avoid
thinking of matches in terms of natural language words that some other
puzzles utilize. However, feel free in solving it to substitute letters
or numerals, which are probably easier to type in your shell. As long as
the correspondences are one-to-one, it doesn't matter what symbols are
used.
A message in this protocol consists of alternating blocks, belonging to
either ``type 1'' or ``type 2.'' Each block has at least one symbol in
it, but type 1 can have any of black square, black up triangle, white
circle, fisheye, or white square, in any number and order of each. Type
2 blocks, in contrast, may have white small square, white square, black
small square, black circle, or black up triangle, in the same way.
Optionally, a space may separate blocks, but it is not required.
The ``end of transmission'' character indicates the end of a message. An
``obvious'' pattern to describe a valid message apparently matches
appropriately. Some examples are shown below:
\begin{verbatim}
Regex: (^(([■▲○◉□]+) ?([▫□▪●▲]+) ?)+)!
Structure 1/2/1/2 | Message '■▲◉▫■▪▫!' is Valid
Structure 1 2 1 2 | Message '■▲◉ ▫ ■ ▪▫!' is Valid
Missing terminator | Message '■▲◉▫■▪▫' is Invalid
Structure 1 1 2 1 | Message '▲▲▲ ■◉■ ▫▫● ◉○○!' is Invalid
\end{verbatim}
The regex pattern shown actually \emph{is} correct in a mathematical
sense. However, it can also become unworkably slow when checking some
messages. For example:
\begin{verbatim}
Quick match |
'■▲○◉□▫□▪●◉◉▫▪▪●●□□▲▲○○◉■◉■▲▲□□◉▲!' is Valid
| Checked in 0.00 seconds
Quick failure |
'■▲○◉■▲▫▪●●■◉■▲▲◉◉◉■□□□▫▫▪●●●▫■◉■!' is Invalid
| Checked in 0.00 seconds
Failure | '▲□□▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲X' is Invalid
| Checked in 4.42 seconds
Slow failure | '▲□□▲▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲X' is Invalid
| Checked in 8.62 seconds
Exponential | '▲▲▲▲▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲▲X' is Invalid
| Checked in 17.59 seconds
One more symbol | '▲▲▲▲□▲□□▲▲□▲□□□□□□□□▲▲□▲□▲□▲▲' is Invalid
| Checked in 31.53 seconds
\end{verbatim}
Why does this happen? Both the valid and the first invalid pattern timed
are longer than those that detect mismatches slowly. You can see that
the time to determine the last four messages are invalid approximately
doubles with each additional character.
Before you look at the explanation, both determine why this occurs and
come up with a solution using an alternate regular expression that still
validates the message format. Your solution should take a small fraction
of a second in all cases (even messages that are thousands of symbols
long).
Note that as in other puzzles that use special characters for visual
presentation, you may find experimenting easier if you substitute
letters or numerals that are easy to type for the symbols used here. It
doesn't change the nature of the puzzle at all; it merely might make it
easier to use your keyboard.
Before you turn the page\ldots{}
\textbf{Try hard to avoid catastrophes.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
The reason why the slow-failing messages fail is somewhat obvious to
human eyes. None of them end with the ``end-of-transmission'' character.
As you can see, whether they end with an entirely invalid symbol
\texttt{X}, or simply with a valid symbol and no terminator, is not
significant.
You may want to think about why the quick-failing message also fails.
Pause for a moment.
But then notice that the final symbol in that message is ``black
square'' which can only occur in type 1 blocks; in the specification, a
type 2 block must always come immediatey before the end-of-transmission
terminator. Nonetheless, the regular expression engine figures that out
in (significantly) less than 1/100th of a second.
What you need to notice is that the symbol set overlaps between type 1
blocks and type 2 blocks. Therefore, it is ambiguous whether a given
symbol belongs to a given block or the next block. If we simply look for
a match, \emph{one possible} match is found quickly, when it exists. For
example, this message that has only the ambiguous ``white square'' and
``black up triangle'' validates immediately:
\begin{verbatim}
Ambiguous quick | '▲▲▲▲□▲□□▲▲□▲□□□□□□□□▲▲□▲□▲□▲▲!' is Valid
| Checked in 0.00 seconds
\end{verbatim}
However we do not know how many blocks of type 1 and how many of type 2
were created in the match (pedantically, I know enough about the
internals of the regular expression engine to know that answer, but we
are not guaranteed by the API; it could be different in a later version
of the library without breaking compatibility).
Regular expressions are not smart enough to look ahead to the final
symbol to make sure it is a terminator, unless we tell it to do so. The
produced answer is still \emph{eventually} correct, but it is not as
efficient as we would like.
The engine tries every possible permutation of ``some symbols in this
block, some in that''---which is of exponential complexity on the length
of the message---before it finally decides that none match.
Let's solve that by doing a little extra work for the engine.
Specifically, before we try to identify alternating type 1 and type 2
blocks, let's just make sure that the entire message is made up of valid
symbols that end with the terminator symbol. That check will complete
almost instantly, and will eliminate very many (but not all) ways of
encountering catastrophic backtracking.
\begin{verbatim}
Regex: (^(?=^[■▲○◉□▫▪● ]+!)(([■▲○◉□]+) ?([▫□▪●▲]+) ?)+)!
Structure 1/2/1/2 | Message '■▲◉▫■▪▫!' is Valid
Structure 1 2 1 2 | Message '■▲◉ ▫ ■ ▪▫!' is Valid
Missing terminator | Message '■▲◉▫■▪▫' is Invalid
Structure 1 1 2 1 | Message '▲▲▲ ■■■ ▫▫▫ ○○○!' is Invalid
Quick match |
'■▲○◉□▫□▪●◉◉▫▪▪●●□□▲▲○○◉■◉■▲▲□□◉▲!' is Valid
| Checked in 0.00 seconds
Quick failure |
'■▲○◉■▲▫▪●●■◉■▲▲◉◉◉■□□□▫▫▪●●●▫■◉■!' is Invalid
| Checked in 0.00 seconds
Failure | '▲□□▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲X' is Invalid
| Checked in 0.00 seconds
Slow failure | '▲□□▲▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲X' is Invalid
| Checked in 0.00 seconds
Exponential | '▲▲▲▲▲▲□□▲▲▲□□□□□□□□▲▲□▲□▲□▲▲X' is Invalid
| Checked in 0.00 seconds
One more symbol | '▲▲▲▲□▲□□▲▲□▲□□□□□□□□▲▲□▲□▲□▲▲' is Invalid
| Checked in 0.00 seconds
Ambiguous quick | '▲▲▲▲□▲□□▲▲□▲□□□□□□□□▲▲□▲□▲□▲▲!' is Valid
| Checked in 0.00 seconds
\end{verbatim}
\newpage
\hypertarget{playing-dominoes}{%
\section{Playing Dominoes}\label{playing-dominoes}}
Dominoes is an old family of games dating at least from the Yuan Dynasty
(around 1300 CE). The game is played with tiles on which each half of
one side is marked, generally with a number of dots corresponding to a
number. Specific games vary in their rules, but most require matching
the symbol or number on half of a tile with the corresponding symbol on
another tile.
There are, in fact, Unicode characters for all the domino tiles that
have zero to six dots on each half. We will come back to those
characters in the next puzzle. As a reminder, some of those Unicode
characters are listed in this table.
\includegraphics{images/Dominoes-examples.png}
The actual codepoints are hard to enter, and hard to see unless they are
displayed at a large font size (as here). But to illustrate the ``game''
our regex will play, we can show examples of, first, a valid/winning
pattern:
\includegraphics{images/Dominoes-good.png}
And second, an invalid/losing pattern:
\includegraphics{images/Dominoes-bad.png}
\newpage
In this game, tiles are placed in linear order, and two may occur
adjacently only if they have the same number of dots where they
``touch.'' Unlike with physical tiles, these symbols may not be turned
around, but maintain the same left-right order.
Because of the display and entry problems mentioned, we play an
alternative version of this game in which ``tiles'' are spelled as ASCII
characters. For example, the winning and losing patterns shown as
Unicode characters are as follows in their ASCII versions:
\begin{verbatim}
# Winning
{1:3}{3:3}{3:6}{6:1}{1:3}{3:3}{3:3}
# Losing
{1:3}{3:3}{6:1}{1:3}{3:3}{3:6}{3:3}
\end{verbatim}
Plays may be of any length. Infinitely many tiles, with ends having the
numbers 1-6 in every combination, are available. Write a regular
expression that distinguishes every winning play from a losing play.
Note that any character sequence that doesn't define a series of one or
more tiles is trivially losing.
Before you turn the page\ldots{}
\textbf{You might do this more efficiently than your first thought.}
\includegraphics{images/Elegant-Flourish-Frame-Extrapolated-19.svg}
\newpage
Because of our ASCII encoding we have a shortcut available for the
regular expression that can judge whether a play is winning. This would
not be available with the icon characters for the domino tiles.
The same digit must occur at the end of one tile, and again at the start
of the next tile. Therefore, we can shortcut specifically matching '3's
with '3's and '5's with '5's. Instead, we can just use a lookahead to
match a backreference group.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# Mismatched ends in bad, malformed syntax in awful}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ good }\OperatorTok{=} \StringTok{\textquotesingle{}}\SpecialCharTok{\{1:3\}\{3:3\}\{3:6\}\{6:1\}\{1:3\}\{3:3\}\{3:3\}}\StringTok{\textquotesingle{}}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ bad }\OperatorTok{=} \StringTok{\textquotesingle{}}\SpecialCharTok{\{1:3\}\{3:3\}\{6:1\}\{1:3\}\{3:3\}\{3:6\}\{3:3\}}\StringTok{\textquotesingle{}}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ awful }\OperatorTok{=} \StringTok{\textquotesingle{}}\SpecialCharTok{\{1:3\}\{\{}\StringTok{3:5}\SpecialCharTok{\}\}\{5:2\}}\StringTok{\textquotesingle{}}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}}\NormalTok{ pat }\OperatorTok{=} \VerbatimStringTok{r\textquotesingle{}\^{}((\{[1{-}6]:([1{-}6])\})(?=$|\{\textbackslash{}3))+$\textquotesingle{}}
\OperatorTok{\textgreater{}\textgreater{}\textgreater{}} \ControlFlowTok{for}\NormalTok{ play }\KeywordTok{in}\NormalTok{ (good, bad, awful):}
\NormalTok{... match }\OperatorTok{=}\NormalTok{ re.search(pat, play)}
\NormalTok{... }\ControlFlowTok{if}\NormalTok{ match:}
\NormalTok{... }\BuiltInTok{print}\NormalTok{(match.group(), }\StringTok{"wins!"}\NormalTok{)}
\NormalTok{... }\ControlFlowTok{else}\NormalTok{:}
\NormalTok{... }\BuiltInTok{print}\NormalTok{(play, }\StringTok{"loses!"}\NormalTok{)}
\NormalTok{...}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
{1:3}{3:3}{3:6}{6:1}{1:3}{3:3}{3:3} wins!
{1:3}{3:3}{6:1}{1:3}{3:3}{3:6}{3:3} loses!
{1:3}{{3:5}}{5:2} loses!
\end{verbatim}
\begin{figure}
\centering
\includegraphics{images/johnny-automatic-right-hand.svg}
\caption{johnny-automatic-right-hand}
\end{figure}
\newpage
\hypertarget{advanced-dominoes}{%
\section{Advanced Dominoes}\label{advanced-dominoes}}
As the last puzzle showed, there are Unicode characters for domino
tiles. In the last puzzle, we played a game of evaluating whether a
particular sequence of ``tiles''---represented by ASCII sequences---was
winning plays. However, in that last puzzle, we took a shortcut by
taking advantage of the internal structure of the ASCII representation.
It is not too hard to match domino tiles as their Unicode characters.
For example, this pattern matches any linear sequence of (horizontal)
tiles:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pat }\OperatorTok{=}\NormalTok{ (}\VerbatimStringTok{r\textquotesingle{}[\textbackslash{}N\{Domino Tile Horizontal{-}00{-}00\}{-}\textquotesingle{}}
\StringTok{\textquotesingle{}}\CharTok{\textbackslash{}N\{Domino Tile Horizontal{-}06{-}06\}}\StringTok{]+)\textquotesingle{}}
\end{Highlighting}
\end{Shaded}
Most of those sequences will not be winning plays, of course. Recall the
examples of winning and losing plays from the prior lesson:
Winning:
\includegraphics{images/Dominoes-good.png}
Losing:
\includegraphics{images/Dominoes-bad.png}
For this game we will simplify in two ways. First, rather than use
hard-to-enter and hard-to-see tile icons, we will use ASCII characters.
In fact, if we only want the tiles with numbers from 1-6 on their ends,
that gives us exactly 36 of them. Conveniently, that happens to be the
same number of symbols as there are numerals plus capital letters (in
English).
However, this puzzle is simplified further by only utilizing four of the
36 possible tiles. Each of those is given the following ASCII
representation. The letters are not mnemonic, but at least they are easy
to type.
\newpage
\begin{longtable}[]{@{}llc@{}}
\toprule
Codepoint & Name & Substitute\tabularnewline
\midrule
\endhead
U+1F03B & Domino Tile Horizontal-01-03 & A\tabularnewline
U+1F049 & Domino Tile Horizontal-03-03 & B\tabularnewline
U+1F04C & Domino Tile Horizontal-03-06 & C\tabularnewline
U+1F05C & Domino Tile Horizontal-06-01 & D\tabularnewline
\bottomrule