ParseXS: extensive refactoring of param processing, and more tests #22580

iabyn · 2024-09-07T20:20:10Z

This set of about 33 commits:

improves the testing infrastructure of ExtUtils::ParseXS - in particular, it makes it possible to include short XS 'files' as heredocs within foo.t files, rather than having to create a separate t/XSfoo.xs file each time;
adds lots of new tests;
adds some new error conditions;
heavily refactors the code which deals with the parsing and code generation of XSUB parameters,
and in particular, stops ANSI parameters being handled by the injection of a fake INPUT section.
There's still more work to do on cleaning up parameter processing, but this was a good place to pause.

JRaspass · 2024-09-08T01:20:47Z

dist/ExtUtils-ParseXS/t/001-basic.t

+
+{
+    # Test 'alien' INPUT parameters: ones which are declared in an INPUT
+    # section but don't appear in the XSUB's signature. This aught to be


ought, not aught, surely? Also once more in this para and twice in the commit msg.

tonycoz · 2024-09-11T00:28:11Z

dist/ExtUtils-ParseXS/lib/ExtUtils/ParseXS.pm

+  {
+    my $fn   = $self->{in_filename};
+    my $opfn = $Options{filename};
+    $fn = $opfn if ref $opfn; # allow string ref as a file handle


$opfn (or $$opfn) is the source of the file, not a file handle

tonycoz · 2024-09-11T04:25:42Z

dist/ExtUtils-ParseXS/t/001-basic.t

+
+{
+    # Test for 'No INPUT definition' error, particularly that the
+    # type is outout correctly in the error message.


outout -> output?

tonycoz · 2024-09-12T05:58:11Z

Looks fine except for the minor issues that have been commented on.

Use a var called $retvar rather than a var called $sv and repeatedly using "RETVAL$sv".

If the passed input 'filename' option is a reference, treat it as a 'string' input file, i.e. open it with: open my $fh, '<', \$string This allows test files to include short XS file texts as heredocs rather than needing a separate t/TestFoo.xs test file.

This test file includes a small package called Foo to create a tied file handle to capture parse output. Rename the package from 'Foo' to 'Capture' to better reflect its purpose, and move its methods from the end of the test file to near the beginning.

XS supports the little-known input line modifiers: INPUT: int a = ($type)MySvIV($arg); int b ; blah($var); int c + blurg($var); They are almost completely untested and unused in core. This commit adds some basic tests for them, including how an embeded double-quote is handled.

For the little-known input line modifiers: INPUT: int a = ($type)MySvIV($arg); int b ; blah($var); int c + blurg($var); change how the variable-expansion quoting is handled. OUTPUT has always used the \a (alert control char) double-quotish delimiter (i.e. q\a.....\a ) when passing a template read in from a typemap file through eval. This avoids problems when the typemap entry itself includes double-quote characters. Commit v5.41.0-83-g2f4409bf70 made INPUT typemap entries also use \a. (For some reason it had always used plain '"' up until then.) This commit makes the evalling also use \a for typmap and other code expressions extracted from the optional init part of an INPUT line. Prior to this commit, the init string was first passed though a quote escape, s/"/\\"/g, before then being passed to eval using ordinary double quotes. This was sufficient when the typemap contained ("), but not when it contained (\"), e.g. int c + foo("abc\"def"); So this commit makes it technically more accurate. It also makes the code more consistent, which will help in some refactoring which will follow.

In various places ParseXS passes a string (usually containing a typemap entry) through eval, in order to expand, e.g.: ($type)SvIV($var) into (int)SvIV(foo) However, in some places it used eval just to expand the variable name, i.e. it does the equivalent of: $expanded_name = eval q("$var"); which is effectively a no-op, since variable names (as extracted from XSUB signatures, INPUT lines etc) will already have been restricted to something like /\w+/ and so don't contain anything funky which could be expanded by the eval. Remove these evals in the interests of code clarify and security. (There doesn't appear to be anything insecure at the moment, but it's always best to avoid unnecessary evals.)

This commit does two things: it finishes off adding support for function pointer types, and it alters which part of ParseXS.pm is responsible for printing out the variable name when emitting a C var declaration. The function pointer fix is a by-product of changing how the var's name is emitted. The change will also allow, in subsequent commits, for the code to be cleaned-up and simplified. For now, it's just unconditionally setting a variable called $printed_name to true, and so all the now-false branches in various places are now never taken. In more detail. INPUT typemaps tend to be in one of two forms: $var = ($type)SvFOO($arg) and { ... big code block ... $var = ...; } For the first, the typemap is used as-is to emit a combined declaration and initialisation, e.g. int some_var = (int)SvFOO(ST(0)); while for the latter, the declaration and initialisation are split: int some_var; .... { ... big code block ... some_var = ...; } Before this commit, the caller of generate_init() or output_init() - usually INPUT_handler() - was responsible for printing out the type component of the variable declaration, while generate_init() was responsible for emitting the variable name and any initialiser. So for example, for the emitted line: int some_var = (int)SvFOO(ST(0)); INPUT_handler() would print "int", then the method generate_init(), via expanding the typemap, would print " some_var = (int)SvFOO(ST(0));" However, function pointer types are a bit of a problem, because in their C declaration, the variable name is embedded *within* the type: int my_int; // var name to the right int (* my_fn_ptr)(); // var name embedded Before this commit, ParseXS would convert a string containing a type which matches '(*)' into a type including the variable name. So given an XSUB variable declared as INPUT int (*)() my_fn_ptr then INPUT_handler(), when printing out the "type", would actually print out "int (* my_fn_ptr)()", and set a flag, printed_name, when calling generate_init(), to let it know that the variable name has already been printed, and so doesn't need printing again. generate_init(), when dealing with a long-form typemap, would just print ";\n", then emit the big initialiser block later. However for a short-form typemap it would panic, because it wants to just print the expanded version of the '$var = ...' typemap, which would lead to the variable name being printed twice, e.g. int (* my_fn_ptr)() my_fn_ptr = ....; It is also makes the code in generate_init() messy, because it's constantly doing 'if ($printed_name) { ... } else { ... }'. This commit fixes the whole mess by the simple expedient of making the caller always responsible for printing the var name in addition to the type; so INPUT_handler() will now print out "int foo" rather than "int" always. We solve the problem of the typemap starting with '$var = ...' by the simple expedient of deleting the leading '$var' substring from it before expanding it. Note that this commit causes some white-space changes in the C code being generated: regardless of whether the INPUT typmap started with a tab, the code now output is usually as if the typmap *did* start with a tab.

The previous commit made the $printed_name var and printed_name flag always true. Now remove that flag and var and make all code which tested it do the 'true' branch unconditionally.

Rename $self->{xsub_map_argname_to_islength} to $self->{xsub_map_argname_to_has_length} A subtle difference: this hash is set when 'length(foo)' is seen in signature, but it actually used when processing parameter 'foo' (rather than parameter 'length(foo)'.

Typemap INPUT and OUTPUT templates, e.g. $var = ($type)SvIv($arg) are fed through an eval in double-quotish context to expand the template variables. However, ParseXS.pm currently includes a whole bunch of other string content in the evals. This is mostly harmless, but makes the code confusing. This commit pulls out all the extra string content to be outside the eval, so that only the bare typemap template gets evaled. For example this commit changes $self->eval_input_typemap_code(qq/print qq\a $init\\n\a/, $argsref); to printf " %s\n", $self->eval_input_typemap_code("qq\a$init\a", $argsref); so that now the 'print' isn't evalled, and there's no need for \\n etc. Also, add a test that default values are still template-expanded. I think this was an accident of implementation rather than being intended, but it turns out that some distos (including Dynaloader) rely upon that behaviour. In theory there should be no visible change in behaviour.

The expression $self->{xsub_map_argname_to_default}->{$var} appears several times within generate_init(). Make the code easier on the eye by assigning it to a lexical.

Consolidate all the code which does $arg = "ST(" . ($num-1) . ")" into one place, where edge-case handling can be centralised.

This method is only used in one place, so delete it and instead inline it in the caller. This will also help with future code refactoring. It's a private method, so in theory this commit should make no difference functionally.

Add tests for 'alien' INPUT parameters: ones which are declared in an INPUT section but don't appear in the XSUB's signature. This ought to be a compile error, but people rely on it to declare and initialise variables which ought to be in a PREINIT or CODE section. Also update a code comment which I got wrong earlier.

Recent refactoring has made it so that the various places in generate_init() which eval a typemap template are all more or less the same now. So only have the call to eval_input_typemap_code() in a single place. Handle slight variants in the value of $expr being evalled by modifying the result of the eval rather than modifying expr before. This is slightly messier, but should provide the same functionality, and will enable further refactoring.

Add a new { } scope around the chunk of perl code in generate_init() which is directly responsible for generating C initialiser code from the var's type via a typemap lookup. In a follow-up commit, this block of code will shortly become one branch of an if-else which will also handle also the case of the initialiser being supplied rather than being looked up in a typemap. This commit also shuffles around the locations of various 'my' declarations to be in more suitable positions either in or outside the new block, and divides the initialisation of %eval_vars into two halves. The is no change in functionality, apart from the error message "Error: No INPUT definition for type '$type'", which now includes the actual type name (e.g. Foo::Bar) rather than the type munged by tr/:/_/.

The previous commit added a new { } scope. Now re-indent to match. whitespace only.

Allow generate_init() to be called with an overridden initialiser template which it will use instead of extracting one from a typemap. Then use it in INPUT_handler(). This is part of a process to eventually do only parsing in INPUT_handler() and do only code generation in generate_init().

Allow generate_init() to be called with: - a no_init parameter, which disables initialiser generation: i.e. skip emitting '= SvIV(ST(0))' or whatever is derived from the typemap entry; - a defer parameter, which is template code which needs expanding and then emitted after all local variables have been declared. Then use it in INPUT_handler(). Also add tests for XSUB parameters which have *both* default values and initialisers. Make one of these tests TODO to preserve the current behaviour, which is that for code like: int foo(a = 111) // 111 is default value INPUT: int a = 222 // 222 is initialiser the initialiser (222) is used, but no "if (items < 1) { a = 111 }" is emitted. The changes in this commit actually fix this bug by default, but I've added a temporary 'XXX' line of code which artificially disables the fix - so that this commit makes no functional changes, and the fix itself can addressed in a separate commit. This commit is part of a process to eventually do *only* parsing in INPUT_handler() and to do *only* code generation in generate_init().

If the XSUB name includes '::', the parser assumes that it's a C++ function being called and auto-adds a 'THIS' (or if new(), 'CLASS') parameter. Test that this actually happens.

Before this commit, the *caller* of generate_init() was responsible for emitting the type and var name part of a variable declaration and initialisation, e.g. "char *s", while generate_init() emitted the rest, e.g. "= (int)SvIV(ST(0));". Following this commit, generate_init() is responsible for emitting the *whole* declaration, including the type and var name. This commit is a further part of a process to eventually do *only* parsing in INPUT_handler() and to do *only* code generation in generate_init(). But in particular, this commit makes it so that in some cases where formerly INPUT_handler() just printed "type var;\n" and returned, it now *always* calls generate_init() to do this (i.e. where before the call was skipped). This is a good thing in principle, because it means that default values (handled by generate_init()) and overridden or skipped initialisers (handled sometimes entirely by INPUT_handler()) are now handled by the same function, and thus can be done correctly. For example, various permutations along the lines of: int foo(a = 1, b = 2, c = NO_INIT) int a = my_int_converter($arg) int b = NO_INIT int c = my_int_converter($arg) were getting completely screwed up: generally, the (if items >= N) { c = my_int_converter(ST(N)) } wasn't getting emitted. This commit actually keeps this broken behaviour for now: it has a couple of 'XXX skip for now' conditions in generate_init(), and adds a bunch of TODO tests. So in principle this commit should have no visible change in behaviour. Later, we can decide whether to fix these behaviours and risk breaking existing XS code.

A length() pseudo-parameter, such as as int foo(char *s, int length(s)) is documented as being allowed only in ANSI-type function declarations. In particular it means that this isn't legal: int foo(s) char *s int length(s) The XS parser currently accepts that without complaint, but the C code it generates is typically malformed: the wrapped C function is called without a second length arg, and if the length() line comes after the 'char *' line, the length var will not be initialised. This commit makes the XS parser treat it as a parsing error,

A test (that was added a year ago) checks that the death() error-reporting method works. It uses a particular syntax error for the test. The behaviour of that particular error will be changed in the next commit: it will call blurt() rather than death(), so death() will no longer be being tested. So this commit updates the test so that it uses a different type of syntax error, which we know will still trigger death().

Originally, the time at which the INPUT part of the XS parser would look up a var's C type and try to associate it with a typemap was at the time the relevant C var declaration and initialisation line was being emitted. If not found, this was raised as a soft error (blurt()), so compilation could continue (but xsubpp would eventually exit(1)). Then a second typemap lookup call was added, this time earlier, during INPUT line processing. This was used to help determine the prototype (if any) of that argument. That code checks whether a valid typemap was found, and not, falls back to using '$' as the prototype. The result of that first typemap lookup is then thrown away. Then, for some reason I don't quite understand, v5.15.0-499-g5a784a6530 added: $self->death("Could not find a typemap for C type '$var_type'") if not $typemap; to the prototype-determining code. This seems a bit redundant, as the typemap lookup a bit further down will report the error anyway, and the prototype code can already code with no typemap being found. But that new typemap error check has two problems. First. It does death() rather than blurt(), stopping any further parsing. Second, it fails to spot an overridden typemap. The three lines below are all valid: they provide their own initialiser, or skip initialisation completely, so there's no need for a typemap entry for the unknown type: INPUT: UnknownType foo1 = NO_INIT UnknownType foo2 = bar(); UnknownType foo3 = baz($arg); The code was now reporting any of those three lines as an error. Then v5.15.2-79-gdcd8b78a7d partially fixed it (by looking for an initialiser containing '$arg' or 'ST('), so that the foo3 line is now valid, but the foo1 and foo2 lines were still generating an error. The foo2 line may not make much sense, but it's legal, and the NO_INIT line style is documented. This commit removes the extra typemap check altogether. This has the two effects of making typemap errors back into blurt()s rather than death()s, and of making all 'UnknownType foo = ...' style lines legal again. It may have the downside of breaking again whatever the extra typemap check was added for, if it turns out that it had a purpose.

Make the regex more readable.

Swap the order of splitting an INPUT line into it's components, and processing where the variable name is 'length(name)'. Makes the code cleaner - especially the pattern matching. The diff is a bit confusing: in reality this commit has just two changes: - the length()-processing s/// and block of code has been moved to later and updated; - the line-splitting regex has been updated to match 'length(name)' in addition to 'name'.

Swap the order of some of the code blocks, so that the parsing code (mainly regexes) comes first, and the checking and state updating comes second. This is in preparation for the second part to be moved into a separate function. Should be no functional changes.

Put some of this sub's lexical state into a hash. This will will make the next commit easier, when a chunk of code is moved out into its own function. No functional changes.

This private new method takes over the per-parameter state-updating and checking currently done in INPUT_handler(). This leaves the body of INPUT_handler() concerned with just the *parsing* of INPUT lines. This new method will also be used shortly for processing ANSI-style XSUB signatures directly rather than faking up synthetic INPUT lines.

ExtUtils::ParseXS had no tests for the Error: duplicate definition of argument 'foo' error message. This commit adds some tests, and also simplifies the duplicate-detecting code. A redundant check was causing the error to be printed *twice* for the same parameter in something like: void foo(int a) int a Removing this redundancy also allows the removal of the now-unused $self->{xsub_map_argname_to_seen_type} object field.

Add tests for where the param is declared OUT and then the type is specified either in the signature or on a separate INPUT line

Initially, XSUBs were declared using a K&R-like syntax, where the parameters' types were specified later in INPUT lines: int foo(x, y) int x int y Then about 25 years ago, ANSI-style signatures were allowed: int foo(int x, int y) but the implementation was *horrible*. Basically the signature parser just injected a fake INPUT section later on into the src code input stream, containing lines like 'int x' etc. As well as being ugly, it made the code hard to understand, and caused all sorts of potential problems, since some syntax was specific to ANSI params, like '=foo' default values, while other syntax was specific to INPUT lines, like '= typemap_override($arg)'. And in particular, the pseudo-parameter 'length(foo)' made *everything* more complicated. The commits leading to this one have been gradually splitting parameter processing code into separate parsing, checking and code-emitting functions. We can now make use of this separation so that, instead of the signature-parsing code pushing a series of strings like 'int x' into an array, to be injected back into INPUT_handler() later, we can instead push a series of hashes along the lines of { name => 'x', type => 'int, ...} Then when this list of hashes is processed, it can pass the hash directly to the checking and code-generating methods, bypassing INPUT_handler() completely.

JRaspass reviewed Sep 8, 2024

View reviewed changes

tonycoz reviewed Sep 11, 2024

View reviewed changes

iabyn added 26 commits September 12, 2024 15:13

ParseXS: generate_output(): tweak code

42c46eb

Use a var called $retvar rather than a var called $sv and repeatedly using "RETVAL$sv".

ParseXS: remove $printed_name and clean up

31136c2

The previous commit made the $printed_name var and printed_name flag always true. Now remove that flag and var and make all code which tested it do the 'true' branch unconditionally.

ParseXS: generate_init(): add $default var

f28cca5

The expression $self->{xsub_map_argname_to_default}->{$var} appears several times within generate_init(). Make the code easier on the eye by assigning it to a lexical.

ParseXS: add ST() method

1378d52

Consolidate all the code which does $arg = "ST(" . ($num-1) . ")" into one place, where edge-case handling can be centralised.

ParseXS: eliminate output_init() method

d8df166

This method is only used in one place, so delete it and instead inline it in the caller. This will also help with future code refactoring. It's a private method, so in theory this commit should make no difference functionally.

ParseXS: generate_init(): re-indent code block

8fb0d13

The previous commit added a new { } scope. Now re-indent to match. whitespace only.

ParseXS: add tests for C++ THIS, CLASS params

eeee920

If the XSUB name includes '::', the parser assumes that it's a C++ function being called and auto-adds a 'THIS' (or if new(), 'CLASS') parameter. Test that this actually happens.

ParseXS: INPUT_handler(): make a regex //x

75b8f80

Make the regex more readable.

iabyn added 6 commits September 12, 2024 15:25

ParseXS: INPUT_handler(): create $param hash

49f9f06

Put some of this sub's lexical state into a hash. This will will make the next commit easier, when a chunk of code is moved out into its own function. No functional changes.

ParseXS: add basic tests for OUT parameter

8755b97

Add tests for where the param is declared OUT and then the type is specified either in the signature or on a separate INPUT line

iabyn force-pushed the davem/xs_refactor4 branch from d191189 to 9c8b8b6 Compare September 12, 2024 14:33

iabyn merged commit bd12780 into blead Sep 12, 2024
67 checks passed

iabyn deleted the davem/xs_refactor4 branch September 12, 2024 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParseXS: extensive refactoring of param processing, and more tests #22580

ParseXS: extensive refactoring of param processing, and more tests #22580

iabyn commented Sep 7, 2024

JRaspass Sep 8, 2024 •

edited

Loading

tonycoz Sep 11, 2024

tonycoz Sep 11, 2024

tonycoz commented Sep 12, 2024

ParseXS: extensive refactoring of param processing, and more tests #22580

ParseXS: extensive refactoring of param processing, and more tests #22580

Conversation

iabyn commented Sep 7, 2024

JRaspass Sep 8, 2024 • edited Loading

Choose a reason for hiding this comment

tonycoz Sep 11, 2024

Choose a reason for hiding this comment

tonycoz Sep 11, 2024

Choose a reason for hiding this comment

tonycoz commented Sep 12, 2024

JRaspass Sep 8, 2024 •

edited

Loading