Skip to content
This repository has been archived by the owner on Aug 26, 2020. It is now read-only.

Support metachars in character sets, including 'negated' short-hands like \W #14

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

anentropic
Copy link

fixes #4
fixes #13

I needed this fixed so I had a go at it...

First thing to note: right at the end having got everything else working I found problems with Python 2.7 (this https://stackoverflow.com/a/28783870/202168 means we can't properly define the inverted character ranges that allowed supporting a set like [^\W_\-] with this PR).

I haven't looked too deeply into workarounds or feature switches that could make it compatible - currently in this PR I've just dropped Python 2.7 support. I appreciate you may not want to do that, but at the moment I'm not sure how hard it would be to get some 2.7 support back.

I have borrowed your Hypothesis tests from fix-sets branch...

I opted to do the regex transformations in the AST domain (i.e. using parsed output from sre_parse.parse) which I think made things a lot easier. After modifying the AST we can use sre_compile.compile to get a compiled regex object, as with re.compile.

One unexpected problem with this approach revealed when running the Hypothesis test... their from_regex strategy relies on looking at the .pattern attr of the compiled regex to get the regex pattern as a string. But sometimes sre_compile fails to set one:

In [79]: sre_parse.parse(".")
Out[79]: [(ANY, None)]

In [80]: parsed = sre_parse.parse(".")

In [82]: c = sre_compile.compile(parsed)

In [83]: c.pattern

i.e. it's None. Also I think it is set once on parse and not recalculated, so even when it's present the pattern attr would not reflect any of our AST modifications.

One option would have been to write a custom un-parser to stringify the parsed AST. But looking at the test, it did this:

    jr = js_regex.compile(pattern)
    for _ in range(3):
        value = data.draw(st.from_regex(jr))
        assert jr.search(value)

This is a fairly weak property, it only really tests that we compiled a regex that Hypothesis' from_regex can read, and that from_regex works properly, but no particular properties of our transformed regex.

So, what I've done is a bit heavy-handed, but using PyMiniRacer ("Minimal, modern embedded V8 for Python") and randexp.js (an equivalent of Hypothesis from_regex in JS) we have access to an 'oracle' in our Hypothesis test that can check our transform is valid. Surprisingly it's not unusably slow (!) and this dependency is only needed for tests.

Doing this meant that we need to accurately reproduce e.g. exact unicode chars belonging to \s in JS otherwise randexp.js will generate examples we can't match.

Anyway, let me know what you think. I'm happy to make any changes you want, if the Python 2.7 thing is not a show-stopper and the general approach is ok.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question re special handling for BELL \a Wrong handling for metacharacters in sets, e.g. [\w0-9]
1 participant