Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseXS: extensive refactoring of param processing, and more tests #22580

Merged
merged 32 commits into from
Sep 12, 2024

Conversation

iabyn
Copy link
Contributor

@iabyn iabyn commented Sep 7, 2024

This set of about 33 commits:

  • improves the testing infrastructure of ExtUtils::ParseXS - in particular, it makes it possible to include short XS 'files' as heredocs within foo.t files, rather than having to create a separate t/XSfoo.xs file each time;
  • adds lots of new tests;
  • adds some new error conditions;
  • heavily refactors the code which deals with the parsing and code generation of XSUB parameters,
  • and in particular, stops ANSI parameters being handled by the injection of a fake INPUT section.
    There's still more work to do on cleaning up parameter processing, but this was a good place to pause.


{
# Test 'alien' INPUT parameters: ones which are declared in an INPUT
# section but don't appear in the XSUB's signature. This aught to be
Copy link
Contributor

@JRaspass JRaspass Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ought, not aught, surely? Also once more in this para and twice in the commit msg.

{
my $fn = $self->{in_filename};
my $opfn = $Options{filename};
$fn = $opfn if ref $opfn; # allow string ref as a file handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$opfn (or $$opfn) is the source of the file, not a file handle


{
# Test for 'No INPUT definition' error, particularly that the
# type is outout correctly in the error message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outout -> output?

@tonycoz
Copy link
Contributor

tonycoz commented Sep 12, 2024

Looks fine except for the minor issues that have been commented on.

Use a var called $retvar rather than a var called $sv and repeatedly
using "RETVAL$sv".
If the passed input 'filename' option is a reference, treat it
as a 'string' input file, i.e. open it with:

    open my $fh, '<', \$string

This allows test files to include short XS file texts as heredocs rather
than needing a separate t/TestFoo.xs test file.
This test file includes a small package called Foo to create a tied
file handle to capture parse output.

Rename the package from 'Foo' to 'Capture' to better reflect its
purpose, and move its methods from the end of the test file to near the
beginning.
XS supports the little-known input line modifiers:

    INPUT:
        int a = ($type)MySvIV($arg);
        int b ; blah($var);
        int c + blurg($var);

They are almost completely untested and unused in core. This commit adds
some basic tests for them, including how an embeded double-quote is
handled.
For the little-known input line modifiers:

    INPUT:
        int a = ($type)MySvIV($arg);
        int b ; blah($var);
        int c + blurg($var);

change how the variable-expansion quoting is handled.

OUTPUT has always used the \a (alert control char) double-quotish
delimiter (i.e.  q\a.....\a ) when passing a template read in from a
typemap file through eval. This avoids problems when the typemap
entry itself includes double-quote characters.

Commit v5.41.0-83-g2f4409bf70 made INPUT typemap entries also use \a.
(For some reason it had always used plain '"' up until then.)

This commit makes the evalling also use \a for typmap and other code
expressions extracted from the optional init part of an INPUT line.

Prior to this commit, the init string was first passed though a quote
escape, s/"/\\"/g, before then being passed to eval using ordinary
double quotes. This was sufficient when the typemap contained ("), but
not when it contained (\"), e.g.

        int c + foo("abc\"def");

So this commit makes it technically more accurate. It also makes the
code more consistent, which will help in some refactoring which will
follow.
In various places ParseXS passes a string (usually containing a
typemap entry) through eval, in order to expand, e.g.:

    ($type)SvIV($var)
into
    (int)SvIV(foo)

However, in some places it used eval just to expand the variable name,
i.e. it does the equivalent of:

    $expanded_name = eval q("$var");

which is effectively a no-op, since variable names (as extracted from
XSUB signatures, INPUT lines etc) will already have been restricted to
something like /\w+/ and so don't contain anything funky which could be
expanded by the eval.

Remove these evals in the interests of code clarify and security. (There
doesn't appear to be anything insecure at the moment, but it's always
best to avoid unnecessary evals.)
This commit does two things: it finishes off adding support for function
pointer types, and it alters which part of ParseXS.pm is responsible for
printing out the variable name when emitting a C var declaration.

The function pointer fix is a by-product of changing how the var's name
is emitted. The change will also allow, in subsequent commits, for the
code to be cleaned-up and simplified. For now, it's just unconditionally
setting a variable called $printed_name to true,  and so all the
now-false branches in various places are now never taken.

In more detail.

INPUT typemaps tend to be in one of two forms:

    $var = ($type)SvFOO($arg)

and
    {
        ... big code block ...
        $var = ...;
    }

For the first, the typemap is used as-is to emit a combined declaration
and initialisation, e.g.

    int some_var = (int)SvFOO(ST(0));

while for the latter, the declaration and initialisation are split:

    int some_var;

    ....

    {
        ... big code  block ...
        some_var = ...;
    }

Before this commit, the caller of generate_init() or output_init() -
usually INPUT_handler() - was responsible for printing out the type
component of the variable declaration, while generate_init() was
responsible for emitting the variable name and any initialiser.

So for example, for the emitted line:

    int some_var = (int)SvFOO(ST(0));

INPUT_handler() would print "int", then the method generate_init(), via
expanding the typemap, would print " some_var = (int)SvFOO(ST(0));"

However, function pointer types are a bit of a problem, because in their
C declaration, the variable name is embedded *within* the type:

    int   my_int;          // var name to the right
    int (* my_fn_ptr)();   // var name embedded

Before this commit, ParseXS would convert a string containing a type
which matches '(*)' into a type including the variable name. So given
an XSUB variable declared as

    INPUT
        int (*)()   my_fn_ptr

then INPUT_handler(), when printing out the "type", would actually
print out "int (* my_fn_ptr)()", and set a flag, printed_name, when
calling generate_init(), to let it know that the variable name has
already been printed, and so doesn't need printing again.

generate_init(), when dealing with a long-form typemap, would
just print ";\n", then emit the big initialiser block later. However
for a short-form typemap it would panic, because it wants to just print
the expanded version of the '$var = ...' typemap, which would lead to
the variable name being printed twice, e.g.

    int (* my_fn_ptr)() my_fn_ptr = ....;

It is also makes the code in generate_init() messy, because it's
constantly doing 'if ($printed_name) { ... } else { ... }'.

This commit fixes the whole mess by the simple expedient of making the
caller always responsible for printing the var name in addition to the
type; so INPUT_handler() will now print out "int foo" rather than "int"
always.

We solve the problem of the typemap starting with '$var = ...' by the
simple expedient of deleting the leading '$var' substring from it before
expanding it.

Note that this commit causes some white-space changes in the C code
being generated: regardless of whether the INPUT typmap started with a
tab, the code now output is usually as if the typmap *did* start with a
tab.
The previous commit made the $printed_name var and printed_name flag
always true. Now remove that flag and var and make all code which tested
it do the 'true' branch unconditionally.
Rename

    $self->{xsub_map_argname_to_islength}
to
    $self->{xsub_map_argname_to_has_length}

A subtle difference: this hash is set when 'length(foo)' is seen in
signature, but it actually used when processing parameter 'foo' (rather
than parameter 'length(foo)'.
Typemap INPUT and OUTPUT templates, e.g.

    $var = ($type)SvIv($arg)

are fed through an eval in double-quotish context to expand the template
variables. However, ParseXS.pm currently includes a whole bunch of other
string content in the evals. This is mostly harmless, but makes the code
confusing.

This commit pulls out all the extra string content to be outside the
eval, so that only the bare typemap template gets evaled.

For example this commit changes

    $self->eval_input_typemap_code(qq/print qq\a $init\\n\a/, $argsref);

to

    printf " %s\n",
      $self->eval_input_typemap_code("qq\a$init\a", $argsref);

so that now the 'print' isn't evalled,  and there's no need for \\n
etc.

Also, add a test that default values are still template-expanded.
I think this was an accident of implementation rather than being
intended, but it turns out that some distos (including Dynaloader) rely
upon that behaviour.

In theory there should be no visible change in behaviour.
The expression

    $self->{xsub_map_argname_to_default}->{$var}

appears several times within generate_init(). Make the code easier on
the eye by assigning it to a lexical.
Consolidate all the code which does $arg = "ST(" . ($num-1) . ")"
into one place, where edge-case handling can be centralised.
This method is only used in one place, so delete it and instead
inline it in the caller. This will also help with future code
refactoring.

It's a private method, so in theory this commit should make no
difference functionally.
Add tests for 'alien' INPUT parameters: ones which are declared in an
INPUT section but don't appear in the XSUB's signature. This ought to be
a compile error, but people rely on it to declare and initialise
variables which ought to be in a PREINIT or CODE section.

Also update a code comment which I got wrong earlier.
Recent refactoring has made it so that the various places in
generate_init() which eval a typemap template are all more or less the
same now.

So only have the call to eval_input_typemap_code() in a single
place.

Handle slight variants in the value of $expr being evalled by modifying
the result of the eval rather than modifying expr before. This is
slightly messier, but should provide the same functionality, and will
enable further refactoring.
Add a new { } scope around the chunk of perl code in generate_init()
which is directly responsible for generating C initialiser code from the
var's type via a typemap lookup.

In a follow-up commit, this block of code will shortly become one branch
of an if-else which will also handle also the case of the initialiser being
supplied rather than being looked up in a typemap.

This commit also shuffles around the locations of various 'my'
declarations to be in more suitable positions either in or outside the
new block, and divides the initialisation of %eval_vars into two halves.

The is no change in functionality, apart from the error message
"Error: No INPUT definition for type '$type'", which now includes the
actual type name (e.g. Foo::Bar) rather than the type munged by tr/:/_/.
The previous commit added a new { } scope. Now re-indent to match.

whitespace only.
Allow generate_init() to be called with an overridden initialiser
template which it will use instead of extracting one from a typemap.
Then use it in INPUT_handler().

This is part of a process to eventually do only parsing in
INPUT_handler() and do only code generation in generate_init().
Allow generate_init() to be called with:

 - a no_init parameter, which disables initialiser generation:
     i.e. skip emitting '= SvIV(ST(0))' or whatever is derived from
     the typemap entry;

 - a defer parameter, which is template code which needs expanding and
    then emitted after all local variables have been declared.

Then use it in INPUT_handler().

Also add tests for XSUB parameters which have *both* default values and
initialisers. Make one of these tests TODO to preserve the current
behaviour, which is that for code like:

    int foo(a = 111)   // 111 is default value
      INPUT:
        int a = 222    // 222 is initialiser

the initialiser (222) is used, but no "if (items < 1) { a = 111 }"
is emitted. The changes in this commit actually fix this bug by default,
but I've added a temporary 'XXX' line of code which artificially disables
the fix - so that this commit makes no functional changes, and the fix
itself can addressed in a separate commit.

This commit is part of a process to eventually do *only* parsing in
INPUT_handler() and to do *only* code generation in generate_init().
If the XSUB name includes '::', the parser assumes that it's a C++
function being called and auto-adds a 'THIS' (or if new(), 'CLASS')
parameter.

Test that this actually happens.
Before this commit, the *caller* of generate_init() was responsible for
emitting the type and var name part of a variable declaration and
initialisation, e.g. "char *s", while generate_init() emitted the rest,
e.g. "= (int)SvIV(ST(0));".

Following this commit, generate_init() is responsible for emitting the
*whole* declaration, including the type and var name.

This commit is a further part of a process to eventually do *only*
parsing in INPUT_handler() and to do *only* code generation in
generate_init().

But in particular, this commit makes it so that in some cases where
formerly INPUT_handler() just printed "type var;\n" and returned, it now
*always* calls generate_init() to do this (i.e. where before the call
was skipped).  This is a good thing in principle, because it means that
default values (handled by generate_init()) and overridden or skipped
initialisers (handled sometimes entirely by INPUT_handler()) are now
handled by the same function, and thus can be done correctly.

For example, various permutations along the lines of:

    int
    foo(a = 1, b = 2, c = NO_INIT)
        int a = my_int_converter($arg)
        int b = NO_INIT
        int c = my_int_converter($arg)

were getting completely screwed up: generally, the

    (if items >= N) { c = my_int_converter(ST(N)) }

wasn't getting emitted.

This commit actually keeps this broken behaviour for now:
it has a couple of 'XXX skip for now' conditions in generate_init(),
and adds a bunch of TODO tests. So in principle this commit
should have no visible change in behaviour. Later, we can decide
whether to fix these behaviours and risk breaking existing XS code.
A length() pseudo-parameter, such as as

    int
    foo(char *s, int length(s))

is documented as being allowed only in ANSI-type function declarations.
In particular it means that this isn't legal:

    int
    foo(s)
        char *s
        int length(s)

The XS parser currently accepts that without complaint, but the
C code it generates is typically malformed: the wrapped C function is
called without a second length arg, and if the length() line comes after
the 'char *' line, the length var will not be initialised.

This commit makes the XS parser treat it as a parsing error,
A test (that was added a year ago) checks that the death()
error-reporting method works. It uses a particular syntax error for the
test. The behaviour of that particular error will be changed in the next
commit: it will call blurt() rather than death(), so death() will no
longer be being tested.

So this commit updates the test so that it uses a different type of
syntax error, which we know will still trigger death().
Originally, the time at which the INPUT part of the XS parser would look
up a var's C type and try to associate it with a typemap was at the
time the relevant C var declaration and initialisation line was being
emitted. If not found, this was raised as a soft error (blurt()), so
compilation could continue (but xsubpp would eventually exit(1)).

Then a second typemap lookup call was added, this time earlier,
during INPUT line processing. This was used to help determine the
prototype (if any) of that argument. That code checks whether a valid
typemap was found, and not, falls back to using '$' as the prototype.
The result of that first typemap lookup is then thrown away.

Then, for some reason I don't quite understand, v5.15.0-499-g5a784a6530
added:

      $self->death("Could not find a typemap for C type '$var_type'")
         if not $typemap;

to the prototype-determining code. This seems a bit redundant, as the
typemap lookup a bit further down will report the error anyway, and the
prototype code can already code with no typemap being found.

But that new typemap error check has two problems. First. It does
death() rather than blurt(), stopping any further parsing.

Second, it fails to spot an overridden typemap. The  three lines below
are all valid: they provide their own initialiser, or skip
initialisation completely, so there's no need for a typemap entry for
the unknown type:

    INPUT:
        UnknownType foo1 = NO_INIT
        UnknownType foo2 = bar();
        UnknownType foo3 = baz($arg);

The code was now reporting any of those three lines as an error.

Then v5.15.2-79-gdcd8b78a7d partially fixed it (by looking for an
initialiser containing '$arg' or 'ST('), so that the foo3 line is now
valid, but the foo1 and foo2 lines were still generating an error.
The foo2 line may not make much sense, but it's legal, and the NO_INIT
line style is documented.

This commit removes the extra typemap check altogether. This has the two
effects of making typemap errors back into blurt()s rather than
death()s, and of making all 'UnknownType foo = ...' style lines legal
again. It may have the downside of breaking again whatever the extra
typemap check was added for, if it turns out that it had a purpose.
Make the regex more readable.
Swap the order of splitting an INPUT line into it's components,
and processing where the variable name is 'length(name)'.

Makes the code cleaner - especially the pattern matching.

The diff is a bit confusing: in reality this commit has just two changes:
- the length()-processing s/// and block of code has been moved to later
  and updated;
- the line-splitting regex has been updated to match 'length(name)' in
  addition to 'name'.
Swap the order of some of the code blocks, so that the parsing code
(mainly regexes) comes first, and the checking and state updating comes
second. This is in preparation for the second part to be moved into a
separate function.

Should be no functional changes.
Put some of this sub's lexical state into a hash. This will will make
the next commit easier, when a chunk of code is moved out into its own
function.

No functional changes.
This private new method takes over the per-parameter state-updating
and checking currently done in INPUT_handler(). This leaves the body of
INPUT_handler() concerned with just the *parsing* of INPUT lines.

This new method will also be used shortly for processing ANSI-style XSUB
signatures directly rather than faking up synthetic INPUT lines.
ExtUtils::ParseXS had no tests for the

    Error: duplicate definition of argument 'foo'

error message.

This commit adds some tests, and also simplifies the duplicate-detecting
code. A redundant check was causing the error to be printed *twice* for
the same parameter in something like:

    void foo(int a)
        int a

Removing this redundancy also allows the removal of the now-unused
$self->{xsub_map_argname_to_seen_type} object field.
Add tests for where the param is declared OUT and then the type is
specified either in the signature or on a separate INPUT line
Initially, XSUBs were declared using a K&R-like syntax, where the
parameters' types were specified later in INPUT lines:

    int
    foo(x, y)
        int x
        int y

Then about 25 years ago, ANSI-style signatures were allowed:

    int
    foo(int x, int y)

but the implementation was *horrible*. Basically the signature parser
just injected a fake INPUT section later on into the src code input
stream, containing lines like 'int x' etc.

As well as being ugly, it made the code hard to understand, and caused
all sorts of potential problems, since some syntax was specific to
ANSI params, like '=foo' default values, while other syntax was specific
to INPUT lines, like '= typemap_override($arg)'.

And in particular, the pseudo-parameter 'length(foo)' made *everything*
more complicated.

The commits leading to this one have been gradually splitting parameter
processing code into separate parsing, checking and code-emitting
functions. We can now make use of this separation so that, instead of
the signature-parsing code pushing a series of strings like 'int x' into
an array, to be injected back into INPUT_handler() later, we can instead
push a series of hashes along the lines of

    { name => 'x', type => 'int, ...}

Then when this list of hashes is processed, it can pass the hash
directly to the checking and code-generating methods, bypassing
INPUT_handler() completely.
@iabyn iabyn merged commit bd12780 into blead Sep 12, 2024
67 checks passed
@iabyn iabyn deleted the davem/xs_refactor4 branch September 12, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants