Regex bug: Unicode hex ranges not supported #46137

ProvocaTeach · 2022-07-22T07:08:45Z

Trying to form Unicode hex ranges in a regular expression causes a LoadError:

julia> r"[\x{00A0}-\x{10FFFD}]"

yields

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[45]:1

The result should be a regex that matches all Unicode codepoints from U+00A0 to U+10FFFD.
Julia version: 1.7.3

The text was updated successfully, but these errors were encountered:

fredrikekre · 2022-07-22T11:21:14Z

man pcre says it has to be valid Unicode points, but that range have a bunch of invalid ones:

julia> count(x -> !isvalid(Char(x)), 0x00A0:0x10FFFD)
2048

In any case, if this is indeed a bug it is a bug in PCRE2 and not Julia.

ProvocaTeach · 2022-07-24T10:46:46Z

Unfortunately, the bug applies to all Unicode ranges, not just ones with invalid characters. Even simply typing

julia> r"[\x{00A0}-\x{00A5}]"

throws a LoadError:

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[94]:1

Please reopen this issue?

fredrikekre · 2022-07-24T10:51:01Z

It is still an error from PCRE, but this one doesn't show up in other environments (e.g. https://regex101.com/) so perhaps there is some compile setting that is different.

inkydragon · 2022-07-31T14:28:20Z

Workaround: \N{U+XXXX}

The escape sequence \N{U+} is recognized as another way of specifying a Unicode character by code point in a UTF mode.
https://www.pcre.org/current/doc/html/pcre2unicode.html

julia> '和'
'和': Unicode U+548C (category Lo: Letter, other)

julia> '平'
'平': Unicode U+5E73 (category Lo: Letter, other)

julia> contains("和", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("平", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("aaa", r"[\N{U+548C}-\N{U+5E73}]")
false

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

mgkuhn · 2022-08-03T11:26:25Z

SOLUTION: The command

julia> r"[\x{00A0}-\x{10FFFD}]"

is short for

julia> using Base.PCRE
julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
             PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.ALT_BSUX | PCRE.UCP,
             PCRE.NO_UTF_CHECK)
ERROR: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] top-level scope
   @ REPL[4]:1

In the latter form, we can play with the compile and match option flags that are passed to the PCRE2 library to specify what flavour of regular-expression behaviour exactly we want.

Doing that, I quickly found that dropping the PCRE.ALT_BSUX compile option suppresses this compilation error:

julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
                    PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.UCP,
                    PCRE.NO_UTF_CHECK)
Regex("[\\x{00A0}-\\x{10FFFD}]",0x040a0000)

Now it is time to actually read the PCRE2 documentation:

man pcre2
man pcre2api

There we find indeed the answer:

         PCRE2_ALT_BSUX

       This  option  request  alternative  handling of three escape sequences,
       which makes PCRE2's behaviour more like  ECMAscript  (aka  JavaScript).
       When it is set:

       (1) \U matches an upper case "U" character; by default \U causes a com‐
       pile time error (Perl uses \U to upper case subsequent characters).

       (2) \u matches a lower case "u" character unless it is followed by four
       hexadecimal  digits,  in  which case the hexadecimal number defines the
       code point to match. By default, \u causes a compile time  error  (Perl
       uses it to upper case the following character).

       (3)  \x matches a lower case "x" character unless it is followed by two
       hexadecimal digits, in which case the hexadecimal  number  defines  the
       code  point  to  match. By default, as in Perl, a hexadecimal number is
       always expected after \x, but it may have zero, one, or two digits (so,
       for example, \xz matches a binary zero character followed by z).

       ECMAscript 6 added additional functionality to \u. This can be accessed
       using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op‐
       tions" below).  Note that this alternative escape handling applies only
       to patterns. Neither of these options affects  the  processing  of  re‐
       placement strings passed to pcre2_substitute().

In other words, Julia asks PCRE2 to implement a slightly more JavaScript-compatible version of regular expressions than the more Perl-compatible flavor it would have given us by default. The man page doesn't explicitly say so, but the way I read it, \x{xxxx} seems not part of the ECMAscript syntax, and is in fact therefore identical to just x{xxxx}. So in other words, you get the same error with

julia> r"[x{00A0}-x{10FFFD}]"
ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 9

And it suddenly all makes sense, because }-x is indeed an out-of-order range.

I guess that choice in favour of ECMAscript syntax for \u, \U and \x warrants to be examined, justified, and documented. (Ideally, I think the Julia manual should contain a self-contained reference of the regular-expression syntax supported.)

So this is clearly not a bug in the PCRE2 C library, but at least an omission in the Julia manual.

mgkuhn · 2022-08-03T13:06:54Z

Digging through the commit history of where the choice of JavaScript-compatible \x\u\U in Julia regular expressions via PCRE.ALT_BSUX came from:

afa1404 in Jan 2015 replaced PCRE compile option PCRE.JAVASCRIPT_COMPAT with PCRE2 option PCRE.ALT_BSUX while upgrading from PCRE to PCRE2, i.e. this seems to be just adjusting to the new API
7909e3d in Mar 2013 added PCRE.JAVASCRIPT_COMPAT to “fix r"\u2220" bug mentioned in make S"..." and "..." throw errors identically #107”

The latter commit was made by @nolta as a “band-air”.

String literals, macro/raw string literals and the resulting differences in quote and backslash escaping clearly had a rather tortuous history in the evolution of Julia. Note that at no point in issue #107 is there any discussion about whether Julia's flavour of PCRE should be more like Perl or more like JavaScript. The choice of the JavaScript variant just happened to cause one error message in one example to disappear, if I understood that discussion correctly.

They wanted match(r"\u2200", "\u2200") to match, whereas in Perl-compatible regular-expression syntax it would have had to be match(r"\x{2200}", "\u2200") because in Perl RE, \u means “lowercase the next letter”. Note that in this example, the first \u is interpreted by PCRE2, whereas the second is part of Julia's string literal syntax. They are not the same syntax, but just happen to overlap in this particular example, whereas e.g. a slight variant such as match(r"\U102200", "\U102200") does not match.

fredrikekre closed this as completed Jul 22, 2022

fredrikekre reopened this Jul 24, 2022

inkydragon added the external dependencies Involves LLVM, OpenBLAS, or other linked libraries label Jul 31, 2022

mgkuhn added domain:docs This change adds or pertains to documentation domain:unicode Related to unicode characters and encodings domain:strings "Strings!" and removed external dependencies Involves LLVM, OpenBLAS, or other linked libraries labels Aug 3, 2022

mgkuhn mentioned this issue Aug 3, 2022

make S"..." and "..." throw errors identically #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex bug: Unicode hex ranges not supported #46137

Regex bug: Unicode hex ranges not supported #46137

ProvocaTeach commented Jul 22, 2022 •

edited

Loading

fredrikekre commented Jul 22, 2022

ProvocaTeach commented Jul 24, 2022 •

edited

Loading

fredrikekre commented Jul 24, 2022

inkydragon commented Jul 31, 2022

mgkuhn commented Aug 3, 2022 •

edited

Loading

mgkuhn commented Aug 3, 2022 •

edited

Loading

Regex bug: Unicode hex ranges not supported #46137

Regex bug: Unicode hex ranges not supported #46137

Comments

ProvocaTeach commented Jul 22, 2022 • edited Loading

fredrikekre commented Jul 22, 2022

ProvocaTeach commented Jul 24, 2022 • edited Loading

fredrikekre commented Jul 24, 2022

inkydragon commented Jul 31, 2022

mgkuhn commented Aug 3, 2022 • edited Loading

mgkuhn commented Aug 3, 2022 • edited Loading

ProvocaTeach commented Jul 22, 2022 •

edited

Loading

ProvocaTeach commented Jul 24, 2022 •

edited

Loading

mgkuhn commented Aug 3, 2022 •

edited

Loading

mgkuhn commented Aug 3, 2022 •

edited

Loading