This repository has been archived by the owner on Aug 26, 2020. It is now read-only.
Support metachars in character sets, including 'negated' short-hands like \W #14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #4
fixes #13
I needed this fixed so I had a go at it...
First thing to note: right at the end having got everything else working I found problems with Python 2.7 (this https://stackoverflow.com/a/28783870/202168 means we can't properly define the inverted character ranges that allowed supporting a set like
[^\W_\-]
with this PR).I haven't looked too deeply into workarounds or feature switches that could make it compatible - currently in this PR I've just dropped Python 2.7 support. I appreciate you may not want to do that, but at the moment I'm not sure how hard it would be to get some 2.7 support back.
I have borrowed your Hypothesis tests from
fix-sets
branch...I opted to do the regex transformations in the AST domain (i.e. using parsed output from
sre_parse.parse
) which I think made things a lot easier. After modifying the AST we can usesre_compile.compile
to get a compiled regex object, as withre.compile
.One unexpected problem with this approach revealed when running the Hypothesis test... their
from_regex
strategy relies on looking at the.pattern
attr of the compiled regex to get the regex pattern as a string. But sometimessre_compile
fails to set one:i.e. it's
None
. Also I think it is set once on parse and not recalculated, so even when it's present thepattern
attr would not reflect any of our AST modifications.One option would have been to write a custom un-parser to stringify the parsed AST. But looking at the test, it did this:
This is a fairly weak property, it only really tests that we compiled a regex that Hypothesis'
from_regex
can read, and thatfrom_regex
works properly, but no particular properties of our transformed regex.So, what I've done is a bit heavy-handed, but using PyMiniRacer ("Minimal, modern embedded V8 for Python") and randexp.js (an equivalent of Hypothesis
from_regex
in JS) we have access to an 'oracle' in our Hypothesis test that can check our transform is valid. Surprisingly it's not unusably slow (!) and this dependency is only needed for tests.Doing this meant that we need to accurately reproduce e.g. exact unicode chars belonging to
\s
in JS otherwise randexp.js will generate examples we can't match.Anyway, let me know what you think. I'm happy to make any changes you want, if the Python 2.7 thing is not a show-stopper and the general approach is ok.