Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Successful negative lookahead assertions do not matche to capture groups #5269

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alexprabhat99
Copy link
Contributor

Fixes: #5245

@alexprabhat99 alexprabhat99 requested a review from a team as a code owner February 4, 2025 19:02
@AlexGuteniev
Copy link
Contributor

I'd recommend running tests locally.
You can use regex to run all regex tests, this command works for me:
python tests\utils\stl-lit\stl-lit.py -Dnotags=ASAN ..\..\llvm-project\libcxx\test ..\..\tests\std ..\..\tests\tr1 --filter=(regex^|/re\.)

stl/inc/regex Outdated
@@ -3610,7 +3607,6 @@ bool _Matcher<_BidIt, _Elem, _RxTraits, _It>::_Match_pat(_Node_base* _Nx) { // c
_Node_end_group* _Node = static_cast<_Node_end_group*>(_Nx);
_Node_capture* _Node0 = static_cast<_Node_capture*>(_Node->_Back);
if (_Cap || _Node0->_Idx != 0) { // update capture data
_Tgt_state._Grp_valid[_Node0->_Idx] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deletion does not just correct the matches for negative assertions. With this change, no capture group at all will ever be matched.

Changes to fix #5245 actually have to be applied here:

STL/stl/inc/regex

Lines 3575 to 3590 in fc15609

case _N_neg_assert:
case _N_assert:
{ // check assert
_It _Ch = _Tgt_state._Cur;
bool _Neg = _Nx->_Kind == _N_neg_assert;
_Bt_state_t<_It> _St = _Tgt_state;
if (_Match_pat(static_cast<_Node_assert*>(_Nx)->_Child) == _Neg) {
// restore initial state and indicate failure
_Tgt_state = _St;
_Failed = true;
} else {
_Tgt_state._Cur = _Ch;
}
break;
}

Specifically, _Tgt_state._Cur = _Ch; insufficiently resets the state after a successful negative assertion. (But it is the appropriate reset for positive assertions.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restored the initial state after a successful negative assertion.

@StephanTLavavej StephanTLavavej added bug Something isn't working regex Everyone's favorite header labels Feb 4, 2025
Copy link
Contributor

@muellerj2 muellerj2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test (like the one in #5245) should be added to verify that the behavior is correct after this PR.

Edit: It would also be nice to add test coverage which verifies that captures are recorded correctly for positive assertions as well.

@@ -3583,7 +3580,11 @@ bool _Matcher<_BidIt, _Elem, _RxTraits, _It>::_Match_pat(_Node_base* _Nx) { // c
_Tgt_state = _St;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing: This state reset after a match failure is both unnecessary and pointless, so I think it should be removed. No caller of _Match_pat() can assume that the match state hasn't changed after failure, because the function does not fully reset the state after trying to match two or more NFA nodes anyway.

stl/inc/regex Outdated
@@ -3583,7 +3580,11 @@ bool _Matcher<_BidIt, _Elem, _RxTraits, _It>::_Match_pat(_Node_base* _Nx) { // c
_Tgt_state = _St;
_Failed = true;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opening brace after else leads to unnecessary nesting. You can just remove the braces and write else if instead.

stl/inc/regex Outdated
if (_Neg) {
_Tgt_state = _St;
} else {
_Tgt_state._Cur = _Ch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing: We can assign _St._Cur to _Tgt_state._Cur here instead of _Ch and eliminate the local variable _Ch. This should also decrease the amount of stack this function uses per call a bit, slightly reducing the danger of stack overflow (as in #997).

Comment on lines +3581 to +3582
} else if (!_Neg) {
_Tgt_state._Cur = _St._Cur;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not equivalent to the prior commit. Now the state is no longer being reset for a successful negative lookahead assertion at all.

(My comment about a superfluous _Tgt_state = _St; statement points a different line, namely the one in the failure case. The one removed by the last commit in the successful case isn't superfluous at all.)

This suggests we really need more test coverage for negative assertions if no test catches this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex Everyone's favorite header
Projects
Status: Work In Progress
Development

Successfully merging this pull request may close these issues.

<regex>: Successful negative lookahead assertions sometimes mistakenly assign matches to capture groups
5 participants