Many warnings in regcomp.c can occur twice #14061

p5pRT · 2014-09-01T03:43:12Z

Migrated from rt.perl.org#122671 (status was 'resolved')

Searchable as RT122671$

p5pRT · 2014-09-01T03:43:12Z

From @khwilliamson

This is a bug report for perl from khw@khw.(none),
generated with the help of perlbug 1.40 running under perl 5.21.4.

The regex compiler has many of its possible warnings emitted in the
first of the two main passes. The problem is that this pass can be
run twice causing the same warning to be emitted twice. This repeated
pass occurs when something the parser encounters makes it decide it
needs to re-encode the pattern as UTF-8. At that point it cleans up
what it's doing at the moment, upgrades the pattern, and restarts. Any
warning already raised by the time it decides to upgrade will be raised
again. (If the pattern is already in UTF-8 at the start of parsing,
there is no a problem.)

Flags:
category=core
severity=low

Site configuration information for perl 5.21.4:

Configured by khw at Sun Aug 31 08:55:05 MDT 2014.

Summary of my perl5 (revision 5 version 21 subversion 4) configuration:
Commit id: 9a7aaf5
Platform:
osname=linux, osvers=3.13.0-35-generic,
archname=x86_64-linux-thread-multi-ld
uname='linux khw 3.13.0-35-generic #62-ubuntu smp fri aug 15
01:58:42 utc 2014 x86_64 x86_64 x86_64 gnulinux '
config_args='-des -Uversiononly -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-g' -A'optimize=-g' -A'optimize=-O0'
-Accflags='-DPERL_BOOL_AS_CHAR' -Dman1dir=none -Dman3dir=none
-DDEBUGGING -DDEBUGGING -Dcc=g++ -Dusemorebits -Dusethreads'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
use64bitint=define, use64bitall=define, uselongdouble=define
usemymalloc=n, bincompat5005=undef
Compiler:
cc='g++', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR
-fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
-D_FORTIFY_SOURCE=2',
optimize='-g -g -O0',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR -fwrapv
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include'
ccversion='', gccversion='4.8.2', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=16, longdblkind=3
ivtype='long', ivsize=8, nvtype='long double', nvsize=16,
Off_t='off_t', lseeksize=8
alignbytes=16, prototype=define
Linker and Libraries:
ld='g++', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/include/c++/4.8 /usr/include/x86_64-linux-gnu/c++/4.8
/usr/include/c++/4.8/backward /usr/local/lib
/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=libc-2.19.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.19'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -g -g -O0 -L/usr/local/lib
-fstack-protector'

@INC for perl 5.21.4:
/home/khw/perl/blead/lib
/home/khw/blead/lib/perl5/site_perl/5.21.4/x86_64-linux-thread-multi-ld
/home/khw/blead/lib/perl5/site_perl/5.21.4
/home/khw/blead/lib/perl5/5.21.4/x86_64-linux-thread-multi-ld
/home/khw/blead/lib/perl5/5.21.4
/home/khw/blead/lib/perl5/site_perl/5.21.3
/home/khw/blead/lib/perl5/site_perl/5.21.2
/home/khw/blead/lib/perl5/site_perl/5.21.1
/home/khw/blead/lib/perl5/site_perl/5.20.0
/home/khw/blead/lib/perl5/site_perl/5.19.12
/home/khw/blead/lib/perl5/site_perl/5.19.11
/home/khw/blead/lib/perl5/site_perl/5.19.10
/home/khw/blead/lib/perl5/site_perl
.

Environment for perl 5.21.4:
HOME=/home/khw
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/home/khw/bin:/home/khw/perl5/perlbrew/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/usr/local/games:/home/khw/iands/www:/home/khw/cxoffice/bin
PERL5OPT=-w
PERL_BADLANG (unset)
PERL_POD_PEDANTIC=1
SHELL=/bin/ksh

p5pRT · 2014-09-01T05:03:16Z

From @cpansprout

On Sun Aug 31 20:43:12 2014, public@khwilliamson.com wrote:

This is a bug report for perl from khw@khw.(none),
generated with the help of perlbug 1.40 running under perl 5.21.4.

-----------------------------------------------------------------
The regex compiler has many of its possible warnings emitted in the
first of the two main passes. The problem is that this pass can be
run twice causing the same warning to be emitted twice. This repeated
pass occurs when something the parser encounters makes it decide it
needs to re-encode the pattern as UTF-8. At that point it cleans up
what it's doing at the moment, upgrades the pattern, and restarts.
Any
warning already raised by the time it decides to upgrade will be
raised
again. (If the pattern is already in UTF-8 at the start of parsing,
there is no a problem.)

C.f. commit 688e039. Maybe a similar technique could be used to fix the others.

--

Father Chrysostomos

p5pRT · 2014-09-01T05:03:16Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2014-09-01T12:06:57Z

From @jkeenan

On Sun Aug 31 20:43:12 2014, public@khwilliamson.com wrote:

This is a bug report for perl from khw@khw.(none),
generated with the help of perlbug 1.40 running under perl 5.21.4.

-----------------------------------------------------------------
The regex compiler has many of its possible warnings emitted in the
first of the two main passes. The problem is that this pass can be
run twice causing the same warning to be emitted twice.

Can you provide some examples?

p5pRT · 2014-09-01T15:41:11Z

From @khwilliamson

On 08/31/2014 11:03 PM, Father Chrysostomos via RT wrote:

On Sun Aug 31 20:43:12 2014, public@khwilliamson.com wrote:

This is a bug report for perl from khw@khw.(none),
generated with the help of perlbug 1.40 running under perl 5.21.4.

-----------------------------------------------------------------
The regex compiler has many of its possible warnings emitted in the
first of the two main passes. The problem is that this pass can be
run twice causing the same warning to be emitted twice. This repeated
pass occurs when something the parser encounters makes it decide it
needs to re-encode the pattern as UTF-8. At that point it cleans up
what it's doing at the moment, upgrades the pattern, and restarts.
Any
warning already raised by the time it decides to upgrade will be
raised
again. (If the pattern is already in UTF-8 at the start of parsing,
there is no a problem.)

C.f. commit 688e039. Maybe a similar technique could be used to fix the others.

The problem is that the first pass can be started anywhere during it.
If the restart occurs before it got to the place where the message would
have been output, and we rely on a global flag, the flag will be set at
the time the message would be output in the re-parse, so it will never
get output. (I haven't investigated, but I hope that 688e039
doesn't suffer from the same problem, that is that the study doesn't
abort during the middle of itself, and then the restudy happens at
varying possible points of completion of the original study.)

A generic example of this happening is:
$ perl -Ilib -le 'qr/(?-o:foo\x{100})/'
Useless (?-o) - don't use /o modifier in regex; marked by <-- HERE in
m/(?-o <-- HERE :foo\x{100})/ at -e line 1.
Useless (?-o) - don't use /o modifier in regex; marked by <-- HERE in
m/(?-o <-- HERE :foo\x{100})/ at -e line 1.

The solution that seems the simplest is to just defer warnings until
pass2. That works for all but the parts of pass2 that can get repeated.
I think the only such part is study_chunk.

On closer examination, I found that a number of warnings currently are
deferred to pass 2, and hence are not subject to this bug.

However, actual errors should be tested for and compilation aborted in
the first pass. That means that if a pattern has both errors and
warnings, the warnings will not be displayed until the errors get fixed.
That really is no different than if a pattern has two errors. The 2nd
isn't shown until the first one gets fixed, as compilation is
(necessarily) stopped after that first one. But it would be a change in
behavior from current output, albeit only for fatally flawed patterns

p5pRT · 2014-09-01T16:25:57Z

From @cpansprout

On Mon Sep 01 08:41:11 2014, public@khwilliamson.com wrote:

However, actual errors should be tested for and compilation aborted in
the first pass. That means that if a pattern has both errors and
warnings, the warnings will not be displayed until the errors get
fixed.
That really is no different than if a pattern has two errors. The
2nd
isn't shown until the first one gets fixed, as compilation is
(necessarily) stopped after that first one. But it would be a change
in
behavior from current output, albeit only for fatally flawed patterns

I think that would be an acceptable change.

--

Father Chrysostomos

p5pRT · 2014-09-04T00:54:43Z

From @cpansprout

On Mon Sep 01 08:41:11 2014, public@khwilliamson.com wrote:

The solution that seems the simplest is to just defer warnings until
pass2.

I have a question: Does that mean non-utf8 patterns might miss the warnings because they don’t need a second pass?

--

Father Chrysostomos

p5pRT · 2014-09-04T01:37:50Z

From @khwilliamson

On 09/03/2014 06:54 PM, Father Chrysostomos via RT wrote:

On Mon Sep 01 08:41:11 2014, public@khwilliamson.com wrote:

The solution that seems the simplest is to just defer warnings until
pass2.

I have a question: Does that mean non-utf8 patterns might miss the warnings because they don’t need a second pass?

No. Everything has what is called a "second" pass. But that pass could
actually be a 3rd pass. It's because what's called the first pass may
have to be done twice. Ugh!. The first pass is to calculate how much
space is needed for the pattern. If the pattern isn't already in UTF-8,
and somewhere in the middle of calculating the space, it discovers it
needs to be encoded in UTF-8 (typically by finding something like a
\x{foo} that means a literal character above ordinal 255), it upgrades
the pattern to UTF-8, which may expand the upper-latin1 characters that
have already been processed into two bytes each, making the current
length calculation wrong. So it starts over, redoing the whole "first"
pass.

Only the portion of the parsing up to the point where the too-big
character is found is wasted. The worst case is if it is the final
character in the pattern.

p5pRT · 2014-09-05T10:46:31Z

From @demerphq

On 4 September 2014 03:37, Karl Williamson <public@khwilliamson.com> wrote:

On 09/03/2014 06:54 PM, Father Chrysostomos via RT wrote:

On Mon Sep 01 08:41:11 2014, public@khwilliamson.com wrote:

The solution that seems the simplest is to just defer warnings until
pass2.

I have a question: Does that mean non-utf8 patterns might miss the
warnings because they don’t need a second pass?

No. Everything has what is called a "second" pass. But that pass could
actually be a 3rd pass. It's because what's called the first pass may have
to be done twice. Ugh!. The first pass is to calculate how much space is
needed for the pattern. If the pattern isn't already in UTF-8, and
somewhere in the middle of calculating the space, it discovers it needs to
be encoded in UTF-8 (typically by finding something like a \x{foo} that
means a literal character above ordinal 255), it upgrades the pattern to
UTF-8, which may expand the upper-latin1 characters that have already been
processed into two bytes each, making the current length calculation
wrong. So it starts over, redoing the whole "first" pass.

Only the portion of the parsing up to the point where the too-big
character is found is wasted. The worst case is if it is the final
character in the pattern.

One day I or someone will get around to making the compiler build an optree
in the first pass, and then do an optimising second pass, and then a opcode
generating third pass.

Instead of what we do now which is a memory sizing first pass, possibly
executed twice, followed by an opcode generating second pass, followed by
an optimizing third pass.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2014-09-07T03:21:08Z

From @khwilliamson

Fixed in blead by 499333d
--
Karl Williamson

p5pRT · 2014-09-07T03:21:09Z

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT · 2015-06-02T03:41:27Z

From @khwilliamson

Thanks for submitting this ticket

The issue should be resolved with the release today of Perl v5.22. If you find that the problem persists, feel free to reopen this ticket

--
Karl Williamson for the Perl 5 porters team

p5pRT · 2015-06-02T03:41:28Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed Jun 2, 2015

p5pRT added the Severity Low label Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many warnings in regcomp.c can occur twice #14061

Many warnings in regcomp.c can occur twice #14061

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

p5pRT commented Sep 4, 2014

p5pRT commented Sep 4, 2014

p5pRT commented Sep 5, 2014

p5pRT commented Sep 7, 2014

p5pRT commented Sep 7, 2014

p5pRT commented Jun 2, 2015

p5pRT commented Jun 2, 2015

Many warnings in regcomp.c can occur twice #14061

Many warnings in regcomp.c can occur twice #14061

Comments

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

From @khwilliamson

p5pRT commented Sep 1, 2014

From @cpansprout

p5pRT commented Sep 1, 2014

p5pRT commented Sep 1, 2014

From @jkeenan

p5pRT commented Sep 1, 2014

From @khwilliamson

p5pRT commented Sep 1, 2014

From @cpansprout

p5pRT commented Sep 4, 2014

From @cpansprout

p5pRT commented Sep 4, 2014

From @khwilliamson

p5pRT commented Sep 5, 2014

From @demerphq

p5pRT commented Sep 7, 2014

From @khwilliamson

p5pRT commented Sep 7, 2014

p5pRT commented Jun 2, 2015

From @khwilliamson

p5pRT commented Jun 2, 2015