Support metachars in character sets, including 'negated' short-hands like \W #14

anentropic · 2020-05-10T20:40:42Z

fixes #4
fixes #13

I needed this fixed so I had a go at it...

First thing to note: right at the end having got everything else working I found problems with Python 2.7 (this https://stackoverflow.com/a/28783870/202168 means we can't properly define the inverted character ranges that allowed supporting a set like [^\W_\-] with this PR).

I haven't looked too deeply into workarounds or feature switches that could make it compatible - currently in this PR I've just dropped Python 2.7 support. I appreciate you may not want to do that, but at the moment I'm not sure how hard it would be to get some 2.7 support back.

I have borrowed your Hypothesis tests from fix-sets branch...

I opted to do the regex transformations in the AST domain (i.e. using parsed output from sre_parse.parse) which I think made things a lot easier. After modifying the AST we can use sre_compile.compile to get a compiled regex object, as with re.compile.

One unexpected problem with this approach revealed when running the Hypothesis test... their from_regex strategy relies on looking at the .pattern attr of the compiled regex to get the regex pattern as a string. But sometimes sre_compile fails to set one:

In [79]: sre_parse.parse(".")
Out[79]: [(ANY, None)]

In [80]: parsed = sre_parse.parse(".")

In [82]: c = sre_compile.compile(parsed)

In [83]: c.pattern

i.e. it's None. Also I think it is set once on parse and not recalculated, so even when it's present the pattern attr would not reflect any of our AST modifications.

One option would have been to write a custom un-parser to stringify the parsed AST. But looking at the test, it did this:

    jr = js_regex.compile(pattern)
    for _ in range(3):
        value = data.draw(st.from_regex(jr))
        assert jr.search(value)

This is a fairly weak property, it only really tests that we compiled a regex that Hypothesis' from_regex can read, and that from_regex works properly, but no particular properties of our transformed regex.

So, what I've done is a bit heavy-handed, but using PyMiniRacer ("Minimal, modern embedded V8 for Python") and randexp.js (an equivalent of Hypothesis from_regex in JS) we have access to an 'oracle' in our Hypothesis test that can check our transform is valid. Surprisingly it's not unusably slow (!) and this dependency is only needed for tests.

Doing this meant that we need to accurately reproduce e.g. exact unicode chars belonging to \s in JS otherwise randexp.js will generate examples we can't match.

Anyway, let me know what you think. I'm happy to make any changes you want, if the Python 2.7 thing is not a show-stopper and the general approach is ok.

…ts checking default re.parse behavior!

anentropic added 10 commits April 29, 2020 18:55

working impl modfying in sre_parse ast domain, BUT weirdly breaks tes…

bf1ceb3

…ts checking default re.parse behavior!

fix for weird interaction with normal re

5c24d81

working with hypothesis test using randexp.js as an oracle

8015dc8

implement the negated char classes \W \D \S without using negation

48f92ec

add py-mini-racer to test deps

98618d4

update comments

4f83053

linting

e457236

fixes for mypy check

f06e25b

install new deps for py2.7 tests

b21bed7

remove python 2.7 support

8623d96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support metachars in character sets, including 'negated' short-hands like \W #14

Support metachars in character sets, including 'negated' short-hands like \W #14

anentropic commented May 10, 2020

Support metachars in character sets, including 'negated' short-hands like \W #14

Are you sure you want to change the base?

Support metachars in character sets, including 'negated' short-hands like \W #14

Conversation

anentropic commented May 10, 2020