-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make regexp/no-dupe-disjunctions
account for nested alternatives
#404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your wonderful work!!
I made two comments.
Co-authored-by: Yosuke Ota <otameshiyo23@gmail.com>
Sorry for the inconvenience. I just found a bug in |
Again, sorry for the inconvenience, @ota-meshi. I think I'm done. Could you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This fixes #402.
To implement this, I needed to introduce 2 new concepts to the code base:
E.g. if
a(c|b)[de]
is the root alternative, thenb
,c
,d
, ande
are nested alternatives.E.g. for the root alternative
(a|b)(c|d|e)f
and nested alternatived
, the language of the partial NFA will be(a|b)df
.These 2 concepts are useful because they allow us to ask: Is there some part of an alternative that is a subset?
This is actually quite a bit more general than simple de-sugaring of character classes. This PR allows use to detect partial subsets and duplications within alternatives, and that includes character classes.
Side note: I expected this to massively impact performance. We are potentially creating a bunch more NFAs and DFAs than before, but (surprisingly) I couldn't measure any difference when I ran this PR against Prism's 2k regexes. (It did find 5 bugs though.)