Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make (*THEN) work - first draft #22215

Draft
wants to merge 14 commits into
base: blead
Choose a base branch
from
Draft

Conversation

florian-pe
Copy link

Currently, a global flag (RExC_seen |= REG_CUTGROUP_SEEN;) is set in regcomp.c when (*THEN) is encountered. Then in S_regmatch() and when encountering a BRANCH node, a choice point is puhsed with PUSH_YES_STATE_GOTO if the cutgroup flag is set, otherwise a normal choice point is pushed with PUSH_STATE_GOTO. Since this flag is global to the regex, every BRANCH node of the regex will push will a "yes state" if the regex contain a single (*THEN), instead of BRANCH nodes pushing regular choice points.

Given that "sayNO" (e.g. simple backtracking) performs a cut if a "yes state" (choice point and cut point) remains and if no_final is true (which it is because backtracking after having encountered (*THEN), to the choice point pushed by (*THEN) causes the regex engine to execute the op CUTGOUP_next_fail which assigns no_final to 1) and that every alternation is also a new cut point, the cut operation has the effect to perform a simple backtracking operation as far as alternations are concerned, because every alternation pushes a choice point/cut point (e.g "yes state"). Normally, only the BRANCH node of the alternation directly containing (*THEN) should push a "yes state".

One solution is to mark the appropriate BRANCH as being the outtermost alternation of a (*THEN) group and check this mark/flag inside case BRANCH (which it already does currently) and push a new choice point or a choice point+cut point accordingly with PUSH_YES_STATE_GOTO or respectively PUSH_STATE_GOTO. An alternative solution is to create a distinct BRANCH_cutgroup regnode type which would always push a choice point+cut point while the BRANCH regnode would always push a regular choice point.

In order to mark BRANCH nodes as containing a (*THEN), a field (has_cutgroup) in the regex parser and compiler context (RExC_state_t) is set to 1 in regbranch() when encountering (*THEN). This field is looked up by S_reg() (regbranch()'s caller) when creating the BRANCH regnode and set the regnode_head data.u_8.flags field of this new regnode accordingly. This field (has_cutgroup) is first saved before calling regbranch(), then set to 0, then call regbranch() is made. After regbranch() returned and after the has_cutgroup field has been checked when creating the regnode, it is restored to its previous value in order to not forget having seen a previous (*THEN) group in a situation like: / (?: a (*THEN) (?: b (*THEN) | c) | d) /.

What needs to be verified is first the S_reg() function, make sure each regmatch() call is preceded by RExC_has_cutgroup = 0, that the RExC_has_cutgroup is reset to had_cutgroup after regmatch() has returned and before S_reg() itself returns, and finally that all new regnode BRANCH take into account has_cutgroup and that the setting flags to 1 doesn't create conflicts elsewhere.

Secondly, TRIEs need to somehow conserve the information that a particular BRANCH node was marked as containing a cutgroup, and appropriately perform the cut in TRIE_next_fail.

Currently, a global flag (RExC_seen |= REG_CUTGROUP_SEEN;) is set in regcomp.c when (*THEN) is encountered. Then in S_regmatch() and when encountering a BRANCH node, a choice point is puhsed with PUSH_YES_STATE_GOTO if the cutgroup flag is set, otherwise a normal choice point is pushed with PUSH_STATE_GOTO. Since this flag is global to the regex, every BRANCH node of the regex will push will a "yes state" if the regex contain a single (*THEN), instead of BRANCH nodes pushing regular choice points.

Given that "sayNO" (e.g. simple backtracking) performs a cut if a "yes state" (choice point and cut point) remains and if no_final is true (which it is because backtracking after having encountered (*THEN), to the choice point pushed by (*THEN) causes the regex engine to execute the op CUTGOUP_next_fail which assigns no_final to 1) and that every alternation is also a new cut point, the cut operation has the effect to perform a simple backtracking operation as far as alternations are concerned, because every alternation pushes a choice point/cut point (e.g "yes state"). Normally, only the BRANCH node of the alternation directly containing (*THEN) should push a "yes state".

One solution is to mark the appropriate BRANCH as being the outtermost alternation of a (*THEN) group and check this mark/flag inside case BRANCH (which it already does currently) and push a new choice point or a choice point+cut point accordingly with PUSH_YES_STATE_GOTO or respectively PUSH_STATE_GOTO. An alternative solution is to create a distinct BRANCH_cutgroup regnode type which would always push a choice point+cut point while the BRANCH regnode would always push a regular choice point.

In order to mark BRANCH nodes as containing a (*THEN), a field (has_cutgroup) in the regex parser and compiler context (RExC_state_t) is set to 1 in regbranch() when encountering (*THEN). This field is looked up by S_reg() (regbranch()'s caller) when creating the BRANCH regnode and set the regnode_head data.u_8.flags field of this new regnode accordingly. This field (has_cutgroup) is first saved before calling regbranch(), then set to 0, then call regbranch() is made. After regbranch() returned and after the has_cutgroup field has been checked when creating the regnode, it is restored to its previous value in order to not forget having seen a previous (*THEN) group in a situation like: / (?: a (*THEN) (?: b (*THEN) | c) | d) /.

!!! This commit doesn't attempt to fix the logic of TRIE regnodes concerning (*THEN) !!!
@florian-pe
Copy link
Author

This pull request is related to the issues
#22193
#21531
and possibly any other issue relating to (*THEN). Possibly this ones for example:
#21755
#17069
#12853
#11443

@demerphq and any other perl developper experienced with the regex engine should approve this pull request before merging it, and possibly modify it or follow with the TRIE modification.

@xenu
Copy link
Member

xenu commented May 14, 2024

I'm not really familiar with the issue, but since this changes the behaviour of regexes, it definitely needs tests that demonstrate that.

@florian-pe
Copy link
Author

florian-pe commented May 14, 2024

Yes, it changes the behavior of regexes but that new behavior is how the regexes are supposed to behave according to the documentation in perlre.
I believe that (*THEN) never worked properly since this feature was originally released in 5.10. The reason for this is that barely anyone uses this feature and also that it isn't properly tested by the test suite. Only 4 tests contain this pattern and not one contains an alternation.

So far, this commit seems to solve these issues:
#22193
#21531
#17069

@florian-pe
Copy link
Author

it isn't properly tested by the test suite. Only 4 tests contain this pattern and not one contains an alternation.

I'm sorry, there *are* more advanced tests for (*THEN) in pat_advanced.t, they were invisible in grep -r's output.

@tonycoz tonycoz added the defer-next-dev This PR should not be merged yet, but await the next development cycle label May 22, 2024
@jkeenan jkeenan removed the defer-next-dev This PR should not be merged yet, but await the next development cycle label Jun 10, 2024
@khwilliamson
Copy link
Contributor

@florian-pe I'm hoping you will continue to work on this

initialize RExC_has_cutgroup
remove erroneous instructions inserted in a previous comit
use the BRANCH_HAS_CUTGROUP macro
@florian-pe
Copy link
Author

florian-pe commented Sep 5, 2024

@khwilliamson Yes, I started working on it again. I thought I had the final solution but there is another piece of logic that is missing.

Some of the added instructions in S_reg() to keep track of the branches containing (*THEN) were wrong and I forgot to initialize the new field of RExC_state_t in Perl_re_op_compile().

I've fixed make_trie() and S_regmatch() so that, in the case of a "jump" trie branch containing a (*THEN)/cutgroup, the trie word is marked as such in trie->wordinfo[]. Later in S_regmatch(), after having matched the trie words and just before jumping into the rest of the branch, it looks up trie->wordinfo[] to see if the branch contains a cutgroup it uses PUSH_YES_STATE_GOTO(TRIE_next) or otherwise it uses PUSH_STATE_GOTO(TRIE_next).

If the regex matching continues, visit a CUTGROUP node and then backtracks into the "cut point" (yes_state) pushed by CUTGROUP, a jump is made to TRIE_next_fail which then jumps itsel to the regex node that follows (another "jump trie branch" or the pattern following the alternation, ie E in: /(?: A ... (*THEN) ... B | C | D) E /x).

The TRIE fix assumes that, when the rest of the branch of a jump trie word backtracks or performs a cut, the control flow must necessarily go to the next trie word in the regex branch order, ie not jump over a trie word/branch. I believe this to be true because 2 trie words must necessarily be separated by at least one alternation. If in the future (or if it is already the case and I'm not aware of it), the TRIE would be modified, in such a way that /(?:abc|abd|def)/ would be tranformed into the degenerate TRIE: <ab[cd]> <def>, where [cd] is a character class, the fix would stop to work, but again, if a general pattern would follow abc or abd in their respective branch, that optimization would not make sense unless this "TRIE" dispatch on the correct jump branch based on the last character matched.

Now, this is the hiccup I've found:

/ \A (?:(?: a (*THEN) | .. (*ACCEPT)) (*THEN) | ... (*ACCEPT) ) (*FAIL) /x  abcde   y   $&  abc

Both current perl and my perl branch match "ab". This is because currently, when backtracking into (*THEN)/CUTGROUP, the choice points of the backtrack stack are popped until reaching the last "cut point"/yes_state. The correct logic for (*THEN) to pop all the choice points (could be happen be implemented in O(1)) that have been pushed since the its alternation branch. For that, each CUTGROUP must contain the id of its matching alternation branch and the choice points pushed by branch (at least those containing a cutgroup) need to store their id in the choice point they push. It should be easy.

Apart from that, make test outputed at the end All tests successful. and all the regex tests were OK, except the ones with _thr that reported skipped. These tests don't include the test above however.

When I'm done I'll make a new clean pull requests with better commit messages. I need to setup git to be able to makes pushes on my own fork first...

@demerphq
Copy link
Collaborator

demerphq commented Sep 5, 2024

FWIW, this is on my todo list to review. Feel free to poke me to remind me to do so.

@florian-pe
Copy link
Author

florian-pe commented Sep 8, 2024

@demerphq
I'm glad to hear it. I started making a moderate refactor of S_regmatch() but I would like to have input from you before spending too much time on it for nothing.

The refactor is about making the backtracking's logic cleaner, it's not about modifying the code of the regops themselves in the measure possible, except for the BRANCH and the backtracking contol verbs. It would also serve as preliminary work to correctly implement (*THEN) and also to directly fix (via the refactor) any bug concerning backtracking control verbs and their interactions. This issue for example #14980 and maybe others. I also believe it could serve as ground work for reworking the capturing group logic, in the future.

The important pieces that would be replaced are: the macros sayYES, sayNO and the various flags modifying their behavior (no_final, do_cutgroup, etc..), the whole state_num = st->resume_state + no_final; and state_num = st->resume_state + 1; /* failure = success + 1 */, PUSH_STATE_GOTO and PUSH_YES_STATE_GOTO. The struct regmatch_state would be modified but only superficially. Each union member would be converted into a separated struct definition but their own specific payload would be untouched, more on this below.

The core idea of the refactor is to make any regex pattern assertion failure simply do: goto backtrack; (or the use the BACKTRACK; macro), without any flag involved. The code below the backtrack: label (inside the while loop, after the switch statement) would only do one thing, which is pop (well not exactly) the last choice point and restore the regex engine state from it, in particular, it would reset the regex program state with state_num = st->resume_state;, never do + 1 or + no_final.

The first implication of this is to modify the various PUSH_STATE_GOTO call. In the example of case BRANCH:, it would do directly PUSH_STATE(BRANCH_next_fail);. Most of the various ${OP}_fail perform some state resetting so they would be mostly untouched, the very few ones that only do sayYES would be removed however. case END: and case ACCEPT: would end up doing goto accept; instead of the final sayYES which unwinds the backtrack stack via a chain of yes_state and sayYES.

The header of the loop isn't pretty but there isn't a reason to change it for now.

The code of push_state: and push_yes_state: would be moved into macros and inlined in each regop but only by necessity. If code duplication is an issue, it would possible to make most regop keep doing goto push_state and only inline in the regop that require it.

After the switch statement and inside the while loop, there would be the label and code of backtrack: and cut:.

After the while loop, there would be additionally the labels accept: and reject:. The code after final_exit: and inside if (reginfo->info_aux_eval) would stay the same, except for the resulting value, more on this below. The backtrack stack would be reset to its original state after final_exit:.

In order to make the more complex patterns work, the backtrack stack would now contain 3 kind of frames: choice point frames, function-like frames and cut frames. The S_regmatch() would contain 3 corresponding local variables, one for the last choice_point, one for the current func_frame and one for the last cut_frame which would be char/void pointer to their respective stack frame.

The choice point frame would be either a regmatch_branch, a regmatch_trie or a quantifier frame, the WHILEM kind of frame. Each choice point would contain all the fields at the top of the current struct regmatch_state above the union. regmatch_branch would also contain all the u.branch fields and similarly for regmatch_trie. The field prev_yes_state would be renamed to prev_cut_state. Finally, each choice point frame would contain the additional fields prev_choice_point_state (because the backtrack stack is now an heterogenous array of choice points, func frames and cut frames) and prev_func_frame.

When backtracking, the choice_point local variable would be reset to choice_point->prev_choice_point_state as well as func_frame and cut_state in the same manner and unconditionally. In particular, if a cut frame sits on top of the topmost choice point frame in the backtrack stack and onto backtracking, the cut frame is automatically skipped over.

Cut frames are never backtracked into. Their purpose is to remove zero or more choice point frames in one step. Cut frames contain a pointer to the previous cut frame and a pointer to the "previous" choice point. When performing a cut, every choice point above the cut frame will be discarded and the choice_point variable will be reset to the prev_choice_point_state field inside the cut frame as well as the previous cut state. The "previous" choice point can be last highest choice point but can also be a deeper choice point in the case of (*THEN) of the test example in my last comment.

Critically, when backtracking or performing a cut, the backtrack stack "top" is reset to whatever is the highest frame in use among the newly resetted choice point, the newly resetted cut point and the current func frame that have been newly resetted in the case of a backtracking or that remained untouched in the case of a cut. There may be dead frames between the highest frame in use and the 2nd highest frame in use but won't be ever made use of again because they are no longer in the choice point frame linked list, the cut frame linked list or the func frame linked list.

Func frames are directly analogous to function call frame of the function call stack of a classical imperative language. They don't contain any backtracking information. They can come from pattern calls or recursive calls but they also can be the first frame of a quantifier or be an eval frame. In the case of a quantifier, they contain the current repeat count. All func frames contain a prev_func_frame field that is equivalent to a return address in the function call stack.

When doing a return, the current func_frame variable is reset but the func_frame is only "popped" if the last choice point frame is below it. If a choice point frame in use is above it, that means that a backtracking event could directly reenter this func frame, or indirectly through a caller or a callee func frame. If the last choice point is below the returning func frame but a cut point frame in use is above it (if it is even possible), the func frame is "popped" and the cut is discarded and resetted (I think).

Since the backtrack stack is segmented, in order to perform a stack frame height comparison, each frame should also contain a depth field.

The S_push_slab function wouldn't be touched but the PL_regmatch_state would become a char or void pointer and be renamed into PL_regmatch_top, because now that each frame has its own typedef struct and its own size, each stack frame should be allocated before being pushed as oppossed to allocate the next one, since we don't know in advance the size of the next frame pushed. This may reduce the memory consumption of the backtrack stack a little.

Looking forward, the regexp_paren_pair struct could be pushed onto the backtrack stack instead of on the save stack, (not exactly how I proposed in anoter issue BTW, what I proposed was besides the point). Basically, when encountering an OPEN or CLOSE, the specific numbered capture regexp_paren_pair struct of the logical paren array would be saved onto the backtrack stack (a 4th kind of frame), then the new start or end field would be written to in the logcial paren pair array. Each saved regexp_paren_pair struct would be prepended into a linked list whose head is the last choice point. When backtracking, each paren struct of the choice point's paren_pair linked list would be used to reset the logical paren pair array.

When it comes to using these stack frames:

  • a BRANCH or TRIE would push a choice point, and a additional cut frame if the current branch contains a cutgroup

  • an independent pattern would push a cut frame. If the pattern inside fails, then it backtracks over the cut frame, if it succeeds, a cut is performed.

  • eval groups (?{}) (??{}) also act as independent patterns so they also push a cut frame for removing all the choice points left after the succeed. If they fail then they backtrack over the cut frame. Since a special clean up is needed after the CODE ran, a choice point frame should be pushed (to the clean up regop case) before pushing the cut frame.

  • lookarounds are the most complex case and they need to push at least once choice point and a cut frame

  • SKIP, PRUNE and COMMIT push a choice point, which when backtrack into goes into the corresponding _fail case which set reginfo->cutpoint and goto reject;

  • CUTGROUP performs a cut as soon as it is reached because all the choice points since the start of the branch are not reachable in any way. This is subtly different from SKIP, PRUNE and COMMIT which can't perform a cut as soon as they are forwardly visited because it's possbile to backtrack over them if they were located in an independent pattern (?> ) or before (*THEN).

I don't know all the details yet concerning conditional patterns and quantifiers, but most quantifiers would push one func frame and a various amount of choice points.

EDIT:
I forgot to say that the result value would need to distinguish reject from reject from a COMMIT or SKIP, but it may or may not be necessary.
Sorry for all the typos.

@florian-pe
Copy link
Author

florian-pe commented Sep 8, 2024

Now that I think about it, it's possible for COMMIT, PRUNE and SKIP to perform a cut when they are visited, but if they are inside any kind of independent subpattern (real independent subpattern, lookaround or eval), they can only cut to the cut point pushed at the start of the independent subpattern, or if they are inside a branch containing a cutgroup, they can only cut to the start of the branch. Otherwise, it should be possible for them to directly cut to the bottom of backtrack stack, thereby removing all previous choice points, when they are visited.

In all cases they should still push a choice point to their respective _fail state after having performed the cut, and in the case of independent subpattern and branch-with-cutgroup, that choice point pushed by COMMIT, PRUNE or SKIP should be placed above the cut point pushed by the outter pattern and the cut performed by COMMIT, PRUNE or SKIP shouldn't discard the outter pattern's cut frame in the first place.

The semantics of cut: I've defined in my previous post leads to the cut of COMMIT, PRUNE, SKIP removing the cut frame pushed by an independent subpattern or a branch-with-cutgroup. The solution is to either use a flag to modify the behavior of cut:, or save the removed cut's data, perform the cut, and push an identical cut frame and then finally push the choice point.

EDIT:
or yet again, a simple hack would be to save the pointer of value of PL_regmatch_top, PL_regmatch_slab, and of the local variable cut_state, let the cut update each one of those variable and savagely reset them afterward instead of performing a clean allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants