Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong behavior of (*THEN) when it follows a pattern containing an alternation #21531

Open
florian-pe opened this issue Sep 28, 2023 · 4 comments

Comments

@florian-pe
Copy link

The following code outputs "abc" when it should output "abcd".

say $& if "abcde" =~ / ^ a (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd ) /x;

The purpose of (??{""}) is to prevent an optimization that modify the original semantics of the pattern.

Here is the description of (*THEN) in perlre which is not entirely precise.

"(*THEN)" "(*THEN:NAME)"
  This is similar to the "cut group" operator "::" from Raku.  Like  "(*PRUNE)",  this  verb  always
  matches,  and  when  backtracked  into  on  failure,  it  causes  the regex engine to try the next
  alternation in the innermost enclosing group (capturing or otherwise) that has alternations.

The relevant part is it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise) that has alternations..

If the real meaning is "the outter innermost enclosing group" then the current behavior is wrong since the outter group containing (*THEN) is (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd )

Otherwise the behavior is still wrong because here the group that the regex engine backtracks into ((?: (??{""}) b | bc (*ACCEPT) )) is not the innermost enclosing group and the regex engine backtracks into a pattern directly before/on the left side of (*THEN) which is precisely what this group is supposed to prevent in the first place.

@demerphq
Copy link
Collaborator

I agree this is not behavior correctly. I am not sure I follow the argument behind the "if the real meaning" part, or the following paragraph, but i do agree that it should match "abcd".

I will try to find time to look into this, but i don't know when.

@demerphq
Copy link
Collaborator

FWIW, the issue seems to be that we set PREGf_SKIP for this pattern, which then disables retrying the stclass. I am not sure what the story is there. Running make test_reonly with the PREGf_SKIP logic disabled as so:

$ git diff
diff --git a/regcomp.c b/regcomp.c
index 6e35d95d2a..7679c56c7e 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -1917,7 +1917,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
             first = REGNODE_AFTER(first);
             goto again;
         }
-        if (sawplus && !sawminmod && !sawlookahead
+        if (0 && sawplus && !sawminmod && !sawlookahead
             && (!sawopen || !RExC_sawback)
             && !(RExC_seen & REG_PESSIMIZE_SEEN)) /* May examine pos and $& */
             /* x+ must match at the 1st pos of run of x's */

I see this:

TEST_ARGS='re/*.t ../ext/re/t/*.t' PERL_TEST_HARNESS_ASAP=1 TESTFILE=harness  ./runtests choose
re/alpha_assertions.t ......... ok      
re/anyof.t .................... ok         
re/begin-once.t ............... ok   
re/bigfuzzy_not_utf8.t ........ ok   
re/charset.t .................. ok      
re/fold_grind_8.t ............. ok     
re/fold_grind_T.t ............. skipped: Couldn't find a UTF-8 turkic locale
re/fold_grind_a.t ............. ok    
re/fold_grind_aa.t ............ ok     
re/fold_grind_d.t ............. ok    
re/fold_grind_l.t ............. ok     
re/fold_grind_u.t ............. ok     
re/keep_tabs.t ................ ok    
re/no_utf8_pm.t ............... ok   
re/opt.t ...................... ok     
re/overload.t ................. ok    
re/pat.t ...................... ok         
re/pat_advanced.t ............. ok      
re/pat_advanced_thr.t ......... ok      
re/pat_psycho.t ............... ok     
re/pat_psycho_thr.t ........... ok     
re/pat_re_eval.t .............. ok       
re/pat_re_eval_thr.t .......... ok       
re/pat_rt_report.t ............ ok         
re/pat_rt_report_thr.t ........ ok         
re/pat_special_cc.t ........... ok   
re/pat_special_cc_thr.t ....... ok   
re/pat_thr.t .................. ok         
re/pos.t ...................... ok   
re/qr-72922.t ................. ok     
re/qr.t ....................... ok   
re/qr_gc.t .................... ok   
re/qrstack.t .................. ok   
re/recompile.t ................ ok     
re/reg_60508.t ................ ok   
re/reg_email.t ................ ok    
re/reg_email_thr.t ............ ok    
re/reg_eval.t ................. ok   
re/reg_eval_scope.t ........... ok     
re/reg_fold.t ................. ok         
re/reg_mesg.t ................. ok      
re/reg_namedcapture.t ......... ok   
re/reg_nc_tie.t ............... ok     
re/reg_nocapture.t ............ ok     
re/reg_pmod.t ................. ok     
re/reg_posixcc.t .............. ok      
re/regex_sets.t ............... ok    
re/regex_sets_compat.t ........ ok      
re/regexp.t ................... ok      
re/regexp_noamp.t ............. ok      
re/regexp_nonull.t ............ ok      
re/regexp_normal.t ............ ok      
re/regexp_notrie.t ............ ok      
re/regexp_qr.t ................ ok      
re/regexp_qr_embed.t .......... ok      
re/regexp_qr_embed_thr.t ...... ok      
re/regexp_trielist.t .......... ok      
re/regexp_unicode_prop.t ...... ok         
re/regexp_unicode_prop_thr.t .. ok         
re/rt122747.t ................. ok   
re/rxcode.t ................... ok     
re/script_run.t ............... ok     
re/speed.t .................... ok     
re/speed_thr.t ................ ok     
re/stclass_threads.t .......... 3/6 # Failed test 4 - Behavior appears to be sub quadratic (19.3, 154.48) at re/stclass_threads.t line 77
# Failed test 5 - Behavior is linear and not quadratic (154.48, 1504.498) at re/stclass_threads.t line 78
# Failed test 6 - Behavior is linear as expected at re/stclass_threads.t line 79
re/stclass_threads.t .......... Failed 3/6 subtests 
re/subst.t .................... ok       
re/substT.t ................... ok       
re/subst_amp.t ................ ok    
re/subst_wamp.t ............... ok       
re/uniprops01.t ............... ok       
re/uniprops02.t ............... ok       
re/uniprops03.t ............... ok       
re/uniprops04.t ............... ok       
re/uniprops05.t ............... ok       
re/uniprops06.t ............... ok       
re/uniprops07.t ............... ok       
re/uniprops08.t ............... ok       
re/uniprops09.t ............... ok       
re/uniprops10.t ............... ok       
re/user_prop_race_thr.t ....... ok   
../ext/re/t/intflags.t ........ ok   
../ext/re/t/lexical_debug.t ... ok     
../ext/re/t/qr.t .............. ok   
../ext/re/t/re.t .............. ok     
../ext/re/t/re_funcs.t ........ 1/40 
#   Failed test '/[ab]+/ has skip'
#   at t/re_funcs.t line 85.
# Looks like you failed 1 test of 40.
../ext/re/t/re_funcs.t ........ Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/40 subtests 
../ext/re/t/re_funcs_u.t ...... ok    
../ext/re/t/reflags.t ......... ok     
../ext/re/t/regop.t ........... ok     
../ext/re/t/strict.t .......... ok     

Test Summary Report
-------------------
re/stclass_threads.t        (Wstat: 0 Tests: 6 Failed: 3)
  Failed tests:  4-6
../ext/re/t/re_funcs.t      (Wstat: 256 (exited 1) Tests: 40 Failed: 1)
  Failed test:  35
  Non-zero exit status: 1
Files=89, Tests=388144, 228 wallclock secs (34.14 usr  0.33 sys + 199.12 cusr 13.55 csys = 247.14 CPU)
Result: FAIL
Finished test run at Fri Sep 29 11:19:47 2023.
make: *** [makefile:867: test_reonly] Error 4

Some of the related logic looks like it has changed since the last time I looked at it, and I suspect this is a regression from those changes, but i don't know for sure.

@florian-pe
Copy link
Author

Yes, sorry this paragraph it's a bit dodgy, I was trying to explain and show that the perldoc description is a bit imprecise but on second thought, the description seems adequate and the "outtermost" adjective would be redundant.

@leonerd
Copy link
Contributor

leonerd commented Feb 29, 2024

For reference, this behaviour hasn't changed since 5.14:

$ eachperl -E 'say $& if "abcde" =~ / ^ a (?: (?: (??{""}) b | bc (*ACCEPT) ) (*THEN) (*FAIL) | bcd ) /x;'

  --- perl5.14.4 --- 
abc

  --- perl5.16.3 --- 
abc

  --- perl5.18.4 --- 
abc

...

  --- perl5.38.2 --- 
abc

  --- perl5.39.8 --- 
abc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants