Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative lookaround assertions sometimes leak capture groups #89702

Closed
jirkamarsik mannequin opened this issue Oct 20, 2021 · 3 comments
Closed

Negative lookaround assertions sometimes leak capture groups #89702

jirkamarsik mannequin opened this issue Oct 20, 2021 · 3 comments
Labels
3.9 only security fixes 3.10 only security fixes topic-regex type-bug An unexpected behavior, bug, or error

Comments

@jirkamarsik
Copy link
Mannequin

jirkamarsik mannequin commented Oct 20, 2021

BPO 45539
Nosy @ezio-melotti, @jirkamarsik

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-10-20.15:52:58.840>
labels = ['expert-regex', 'type-bug', '3.9', '3.10']
title = 'Negative lookaround assertions sometimes leak capture groups'
updated_at = <Date 2021-10-21.16:19:16.835>
user = 'https://github.com/jirkamarsik'

bugs.python.org fields:

activity = <Date 2021-10-21.16:19:16.835>
actor = 'mrabarnett'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2021-10-20.15:52:58.840>
creator = 'jirkamarsik'
dependencies = []
files = []
hgrepos = []
issue_num = 45539
keywords = []
message_count = 2.0
messages = ['404479', '404615']
nosy_count = 3.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'jirkamarsik']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue45539'
versions = ['Python 3.9', 'Python 3.10']

@jirkamarsik
Copy link
Mannequin Author

jirkamarsik mannequin commented Oct 20, 2021

When you have capture groups inside a negative lookaround assertion, the strings captured by those capture groups can sometimes survive the failure of the assertion and feature in the returned Match object.

Here it is illustrated with lookbehinds and lookaheads:

>>> re.search(r"(?<!(a)c)de", "abde").group(1)
'a'
>>> re.search(r"(?!(a)c)ab", "ab").group(1)
'a'

Even though the search for the expression '(a)c' fails when trying to match 'c', the string 'a' is still reported as having been successfully matched by capture group 1. The expected behavior would be for the capture group 1 to not have a match.

Because of the following reasons, I believe this behavior is not intentional and is the result of Python not cleaning up after the asserted subexpression fails (e.g. by running the asserted subexpression in a new stack frame).

  1. This behavior is not being systematically enforced.
    We can observe this behavior only in certain cases. Modifying the expression to use the branching operator | inside the asserted subexpression leads to the expected behavior.
>>> re.search(r"(?<!(a)c|(a)d)de", "abde").group(1) is None
True
>>> re.search(r"(?!(a)c|(a)d)ab", "ab").group(1) is None
True
  1. Other languages do not leak capture groups from negative lookarounds.

    Node.js (ECMAScript):

/(?<!(a)c)de/.exec("abde")[1]
undefined
/(?!(a)c)ab/.exec("ab")[1]
undefined
/(?<!(a)c|(a)d)de/.exec("abde")[1]
undefined
/(?!(a)c|(a)d)ab/.exec("ab")[1]
undefined

MRI (Ruby):

irb(main):001:0> /(?<!(a)c)de/.match("abde")[1]
<unsupported>
irb(main):002:0> /(?!(a)c)ab/.match("ab")[1]
=> #<MatchData "ab" 1:nil>
irb(main):003:0> /(?<!(a)c|(a)d)de/.match("abde")[1]
<unsupported>
irb(main):004:0> /(?!(a)c|(a)d)ab/.match("ab")[1]
=> #<MatchData "ab" 1:nil 2:nil>

JShell (Java):

jshell> Matcher m = java.util.regex.Pattern.compile("(?<!(a)c)de").matcher("abde")
jshell> m.find()
jshell> m.group(1)
$3 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?<!(a)c|(a)d)de").matcher("abde")
jshell> m.find()
jshell> m.group(1)
$6 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?!(a)c)ab").matcher("ab")
m ==> java.util.regex.Matcher[pattern=(?!(a)c)ab region=0,2 lastmatch=]
jshell> m.find()
jshell> m.group(1)
$9 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?!(a)c|(a)d)ab").matcher("ab")
m ==> java.util.regex.Matcher[pattern=(?!(a)c|(a)d)ab region=0,2 lastmatch=]
jshell> m.find()
jshell> m.group(1)
$12 ==> null

  1. Not leaking capture groups from negative lookarounds is symmetric to how capture groups are treated in failed matches.
    When regular expression engines fail to match a regular expression, they do not provide a partial match object that contains the state of capture groups at the time when when the matcher failed. Instead, the state of the matcher is discarded and some bottom value is returned (None, null or undefined). Similarly, one would expect nested subexpressions to behave the same way, so that capture groups from failed match attempts are discarded.

@jirkamarsik jirkamarsik mannequin added 3.9 only security fixes topic-regex type-bug An unexpected behavior, bug, or error labels Oct 20, 2021
@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Oct 21, 2021

It's definitely a bug.

In order for the pattern to match, the negative lookaround must match, which means that its subexpression mustn't match, so none of the groups in that subexpression have captured.

@mrabarnett mrabarnett mannequin added 3.10 only security fixes labels Oct 21, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@ghost
Copy link

ghost commented Apr 11, 2022

This bug was fixed in 356997c.
The fix will not be backported to 3.10 branch, only 3.11+ branches are fixed.

Python 3.10.4

>>> re.search(r"(?<!(a)c)de", "abde").groups()
('a',)
>>> re.search(r"(?!(a)c)ab", "ab").groups()
('a',)

Python 3.11 a7+

>>> re.search(r"(?<!(a)c)de", "abde").groups()
(None,)
>>> re.search(r"(?!(a)c)ab", "ab").groups()
(None,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant