Skip to content

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Sep 26, 2025

There are a bunch of functions that are passed both a string pointer and a pointer to the string end. This commit allows you to declare the latter in embed.fnc, and have an assert(s < e) generated that goes in the PERL_ARGS_ASSERT for the function.

This is extended for functions where it should instead be assert(s <= e) and assert(s <= cur)

  • This set of changes does not require a perldelta entry.

@khwilliamson khwilliamson force-pushed the EPTR branch 2 times, most recently from d2f1327 to 4da32a7 Compare September 28, 2025 14:05
Comment on lines -189 to +191
die_at_end "$plain_func: u flag only usable with m" if $flags =~ /u/;
die_at_end "$plain_func: u flag only usable with m"
if $flags =~ /u/;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Comments/white-space

Just a side-note: personally I don't like this style very much but I have no idea what the style-guide says about it (if there even is one).

If I would write I would probably switch to:

if ($flags =~ /u/) {
    die_at_end "$plain_func: u flag only usable with m"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is less desirable than if it could be on the same line, but this paradigm is fairly common in our source. Perhaps it is because I'm a native English speaker, but this feature of Perl where the 'if' is trailing makes things more understandable at a glance to me even when on a separate line, than the alternative.

regen/embed.pl Outdated
Comment on lines 325 to 345
warn "$func: $arg needs NN or NULLOK\n";
warn "$func: $arg needs NN, [EMS]PTR, or NULLOK\n";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Add ability to assert(s < e)

A bit of nit-picking: EPTRQ is also an option. Might be implied by EPTR tho.
Slightly wondering if it would be clearer if they are spelled out, i.e.: $arg needs one of NN, SPTR, MPTR, EPTR, EPTRQ, NULLOK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworded the message

Comment on lines +417 to +478
if ( @bounded_strings
&& ! defined $bounded_strings[-1]{$ptr_type})
{
$bounded_strings[-1]{$ptr_type} = \%entry;
}
else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Add ability to assert(s < e)

(Still going through the changes - so comment my be premature but still adding it for now)

At this point it's not clear to me why we can't just add the assert() here and why the @bounded_strings is needed.
Based on the comments I have some vague idea on what it's trying to handle but without deeper looking (and/or testing) that is not yet clear to me.

Can you provide an example of the cases that it handles?

Update:

After some more thinking I guess the main problem is that in for example:

ATdp	|char * |delimcpy	|SPTR char *to				\
				|EPTRQ const char *to_end		\
				|SPTR const char *from			\
				|EPTRQ const char *from_end		\

it needs a way to "link" to and to_end together.

And at the same time also wants to support:

ATdp	|char * |delimcpy	|EPTRQ char *to_end				\
				|SPTR const char *to		\
				|SPTR const char *from			\
				|EPTRQ const char *from_end		\

(i.e. to and to_end switching places)

The real problem: when we see SPTR/EPTR/... the code needs to figure to what string it belong.
I think it might be good to explicitly state that difficulty somewhere in the comments.

Also wondering if it should be made explicit if multiple pointers are present.
That is: using something like:

ATdp	|char * |delimcpy	|SPTR1 char *to				\
				|EPTRQ1 const char *to_end		\
				|SPTR2 const char *from			\
				|EPTRQ2 const char *from_end		\

so that there is no need to make an assumption that they come in pair/in triplets.

It would also make it more flexible since then something like:

ATdp	|char * |delimcpy	|SPTR char *to				\
				|SPTR const char *from			\
				|EPTRQ const char *to_end		\
				|EPTRQ const char *from_end		\

could also be supported. (But I'm not sure if that is really worth the trouble/effort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added extensive comments to this point in the code that hopefully explain things adequately at this point.

Comment on lines +448 to +497
if (1 == ( (defined $string->{S})
+ (defined $string->{M})
+ (defined $string->{E})))
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Add ability to assert(s < e)

I'm wondering.. for MPTR: should it enforce that it has both a SPTR and an EPTR?
The docs contains:

: MPTR means that not only must this pointer parameter be non-NULL, it
: points to a position in a character string such that
: SPTR <= MPTR < EPTR. The called function is free to look at any
: position in the range starting at SPTR up to but not including
: EPTR.

but if there is only a SPTR or only a EPTR then it can not guarantee that..
Maybe it should raise an error or a warning in such a case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added more extensive comments to embed.fnc about how you only need two constraints, and MPTR or SPTR are chosen based on the semantics of the parameter, even though the generated constraint could be the same in either case.

embed.fnc Outdated
Comment on lines 199 to 203
: MPTR means that not only must this pointer parameter be non-NULL, it
: points to a position in a character string such that
: SPTR <= MPTR < EPTR. The called function is free to look at any
: position in the range starting at SPTR up to but not including
: EPTR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Add ability to assert(s < e)

How does this work in combination with EPTRQ?
Just a dummy example:

                |EPTRQ char *to_end				\
				|SPTR const char *to		\
				|MPTR const char *to_middle

Will the assert be: to <= to_middle && to_middle <= to_end?
Or will it still be: to <= to_middle && to_middle < to_end?

(I'm guessing the former, also not sure if it should be explicitly stated or not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments I've added as a result of yours here will make this clear.

regen/embed.pl Outdated
Comment on lines 473 to 476
push @asserts, "assert($lower->{deref}$lower->{argname}"
. " <$upper->{equal} "
. "$upper->{deref}$upper->{argname})";
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit regen/embed.pl: Add ability to assert(s < e)

The code works, it just required some thought to figure how it works.
which makes me wonder if it wouldn't be clearer to just write it more explicit, i.e. something like (untested):

if ($string->{S} and $string{E}) {
    push @asserts, "assert($string->{S}->{deref}$string->{S}->{argname} <assert($string->{E}->{equal} assert($string->{E}->{deref}$string->{E}->{argname})";
}
if ($string->{S} and $string{M}) {
    push @asserts, "assert($string->{S}->{deref}$string->{S}->{argname} <= assert($string->{M}->{deref}$string->{M}->{argname})";
} 
if ($string->{M} and $string{E}) {
    push @asserts, "assert($string->{M}->{deref}$string->{M}->{argname} <assert($string->{E}->{equal} assert($string->{E}->{deref}$string->{E}->{argname})";
} 

Or I guess something like: (assuming v5.40.0, untested)

    for my ($lower_i, $upper_i) (("S", "E"), ("S", "M"), ("M", "E")) {
        my $lower = $string->{$lower_i} or next;
        my $upper = $string->{$upper_i} or next;

        push @asserts, "assert($lower->{deref}$lower->{argname}"
                                        . " <$upper->{equal} "
                                        . "$upper->{deref}$upper->{argname})";
    }

(But I'm guessing this might generate an extra assert? With the current code when SPTR, MPTR and EPTR are all set I guess it creates the asserts: assert SPTR <= MPTR and assert MPTR < EPTR; while with the untested code above it would result in: assert SPTR < EPTR, assert SPTR <= MPTR, assert MPTR < EPTR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is commented to avoid using modern-only constructs; I tried then using your first suggestion, and found that there was too much repetition of complicated expressions for my taste. I think the loop I have is easier to maintain. But I've added a bunch of comments to explain it.

Copy link

@bram-perl bram-perl Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a fyi, if you want to avoid modern-only construct you could write it as (untested):

    for my $i (["S", "E"], ["S", "M"], ["M", "E"]) {
        my ($lower_i, $upper_i) = @$i;
        my $lower = $string->{$lower_i} or next;
        my $upper = $string->{$upper_i} or next;

        push @asserts, "assert($lower->{deref}$lower->{argname}"
                                        . " <$upper->{equal} "
                                        . "$upper->{deref}$upper->{argname})";
    }

or:

    for my $i (["S", "E"], ["S", "M"], ["M", "E"]) {
        my $lower = $string->{$i->[0]} or next;
        my $upper = $string->{$i->[1]} or next;

        push @asserts, "assert($lower->{deref}$lower->{argname}"
                                        . " <$upper->{equal} "
                                        . "$upper->{deref}$upper->{argname})";
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is commented to avoid using modern-only constructs

It indeed says:

require 5.004;  # keep this compatible, an old perl is all we may have before
                # we build the new one

But it doesn't make any sense to me. This file isn't executed while perl is built. It's only executed by people who develop the interpreter, who all should have access to modern Perl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your solution here is a lot clearer than mine; so I've adopted it, but added a statement to skip s < e if there is s <= m < e

embed.fnc Outdated
|NN const U8 * const s \
|NN const U8 * const e \
|SPTR const U8 * const s \
|EPTR const U8 * const e \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit embed.fnc: Add ptr assertions for apparently non-problematic

I then ran the test suite, and reverted any that had problems.

Out of curiosity: do you still have the list of the failing ones somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; I plan to go make every function with these kinds of parameters to have constraints added to embed.fnc.. Maybe I should do that now for each of them, using EPTRQ. Then I or someone else could go through them to see why there is an apparent discrepancy.

Comment on lines +727 to +729
PERL_ARGS_ASSERT_IS_UTF8_CHAR_HELPER_;
assert(0 == (flags & ~UTF8_DISALLOW_ILLEGAL_INTERCHANGE));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit embed.fnc: Add ptr assertions for apparently non-problematic

I think it would be better to move some of these changes in a separate commit.
In this commit two things happen:

  • the assert calls are moved to the beginning of the function (since we can now do that)
  • one of the asserts (assert(e > s)) is moved the PERL_ARGS_ASSERT macro.

Thinking some more: how I would likely do it:

  • commit 1: add SPTR/EPTR/MPTR in embed.fnc except for the functions in the utf8.c file
  • commit 2: move the asserts to the beginning of the function (and keep the assert e > s)
  • commit 3: add the SPTR/EPTR/MPTR in embed.fnc for these function and remove the assert(e > s).

In commit 3 it should be obvious that these are now moved to the PERL_ARGS_ASSERT macro. In the current commit it's not obvious because there are many changes in proto.h)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, for all affected .c files

Inspection showed these were wrongly categorized
It's best to have these at the beginning to guarantee that the asserts
come before any invalid use of the parameters.

The functions done here will have their asserts manipulated in a few
commits hence.  Doing this now will highlight just the changes that
one causes.
Wrap to fit in 80 column terminal window
This:

* reorders some comments to make more sense
* makes minor clarifications
* removes comments that no longer make sense
* adds comments about moving ARGS_ASSERT macros to the top of their
  functions when the code around them gets changed anyway.
Where s is a pointer into a string, and e is the end of it.
This uses the previous commit's new abilities on a couple of sample
functions, so you can see the changes it makes in two small bites.
I went through the declarations in embed.fnc and added PTR constraints
for all the ones that looked to have pointers to the beginning and end
of a string.  I then ran the test suite, and reverted any that had
problems.

Then I looked at the code for each one remaining to see if it was
equipped to handle the case where the end == the beginning, and removed
those.

This is the result.  Testing in the field may reveal others that the
test suite missed; we can fix those as they occur.

I removed now redundant asserts that were in the functions, and now are
included in the ARGS_ASSERT macros
This is the first use of the new MPTR constraint that is used to
generate an assertion that a pointer is somewhere in the middle of a
string.

I removed now redundant asserts that were in the functions, and now are
included in the ARGS_ASSERT macros
Generally, a pointer to a string upper bound actually is to one beyond
the actual final byte in the string.  This is sanctioned by the C
Standard, and allows you to just subtract the lower bound from it to get
its length, without having to add 1.

But some functions are written to tolerate the upper bound pointer being
set to the actual final byte.  The EPTRQ constraint in embed.fnc is used
for those; the assertion becomes 'l <= u' instead of strictly less-than.

This commit is the first to use this type of constraint, and it applies
it only to those functions whose documentation or behavior clearly
indicate this is expected.

I removed now redundant asserts that were in the functions, and now are
included in the ARGS_ASSERT macros

There's a dozen-ish ones where that isn't true.  And they need to be
investigated further before deciding their disposition.

l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants