Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to pcre2 #515

Merged
merged 10 commits into from
Mar 25, 2023
Merged

Migrate to pcre2 #515

merged 10 commits into from
Mar 25, 2023

Conversation

tobil4sk
Copy link
Member

PCRE 1 is no longer maintained and doesn't get updates, so this PR ports the regexp module of the standard library to use PCRE 2 instead. All build systems have been updated to link PCRE 2.

See also:
HaxeFoundation/haxe#10491

@Uzume
Copy link

Uzume commented Apr 21, 2022

I am surprised this was merged on the deprecated nekovm before here:

@Simn
Copy link
Member

Simn commented Apr 21, 2022

This one isn't green. :P

@tobil4sk
Copy link
Member Author

@Simn Check again ;)

@Simn
Copy link
Member

Simn commented Apr 23, 2022

I'm a bit irritated that there are so many new .c files. Is this just because they reorganized something? I just want to make sure we don't add anything unnecessary here because the stats say that this PR has a net gain of like 26k locs.

@tobil4sk
Copy link
Member Author

It's just because we have an entire copy of the pcre library within include/pcre, so updating to pcre2 requires updating all these files. I think the reasoning behind this has been to make it easier to compile on Windows, so I just followed the legacy and kept the library there. (I did try to see if any of them could be removed, however, all of the ones that are left seem necessary)

@tobil4sk
Copy link
Member Author

And yes, the files were reorganised with pcre2 as far as I can tell. This the documentation file that describes how to build pcre2 on Windows which I followed:
https://github.com/PCRE2Project/pcre2/blob/master/NON-AUTOTOOLS-BUILD

@Simn
Copy link
Member

Simn commented Apr 23, 2022

@ncannasse Please check if this works for you and merge accordingly. We should make this change, but I'd like to make sure it doesn't break anything for you.

@Aurel300
Copy link
Member

By the way, for large dependencies included in the repo you might want to mark them as "vendored" in the attributes: .gitattributes

@tobil4sk
Copy link
Member Author

Ah, I wasn't aware of that. We should probably just mark the entire include directory, although that might be outside the scope of this PR.

@tobil4sk
Copy link
Member Author

#541 should mark these directories as third party libs as you suggested.

@ncannasse
Copy link
Member

Few comments:

I'm quite worried with the following:

// this is reinitialised for each new regex object...
	match_context = pcre2_match_context_create_16(NULL);
	pcre2_set_depth_limit_16(match_context, 3500); 

Seems like a memory leak because the context is never free

@Uzume
Copy link

Uzume commented Apr 26, 2022

@tobil4sk Why are you adding the _16 suffixes all over in src/std/regexp.c? The macros in include/pcre/pcre2.h already take care of adding such when PCRE2_CODE_UNIT_WIDTH is defined appropriately (and they complain verbosely when it isn't)? There is no need for that extra noise (and it is possible such things could become problematic too if at some point the direct API changes and the macro API remains the same, etc.).

I also agree with @ncannasse, if you are calling pcre2_match_context_create then there should be a matching call to pcre2_match_context_free somewhere to balance things (unless I am missing some sort of automatic garbage collection somewhere). match_context should probably be moved into ereg and then pcre2_match_context_free called on it in its finalize (regexp_finalize).

Incidentally, since it seems you are always setting the match recursion depth limit to a constant, you could instead set the compile-time option MATCH_LIMIT_DEPTH (either in include/pcre/config.h or externally in the make files) and then pass NULL for the pcre2_match_context parameter of pcre2_match. That would alleviate having to allocate and deallocate such in src/std/regexp.c ereg.

It seems like the same memory leak was already introduced into nekovm with HaxeFoundation/neko#249 (where oddly you decided to not use the _8 suffixes).

@tobil4sk
Copy link
Member Author

Seems like a memory leak because the context is never free

@ncannasse This is required for the entire lifetime of the program, which is why it could not be freed anywhere. The memory leak existed before as well, however, I guess now is a good time to deal with it. With @Uzume's tip (thanks btw!) about the compile time option, we can avoid this issue altogether.

Why are you adding the _16 suffixes all over in src/std/regexp.c

@Uzume I left the 16 suffixes because they were used before (whereas the neko files didn't have them). I can remove them but I assumed there was some reason to add them if they're optional.

@tobil4sk
Copy link
Member Author

tobil4sk commented Apr 27, 2022

Although, the only issue with this is that now the variable must be set externally (either in config.h or every build system) rather then conveniently in the regexp file. Setting it manually in make/cmake/vs builds is a bit messy, but at the same time having it in config.h means that there's a risk that when updating the library in future, someone may miss this configuration out, which would silently cause problems.

EDIT:
Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag.

@Uzume
Copy link

Uzume commented Apr 27, 2022

@tobil4sk No, I do not think there was a memory leak before. Before we had:

// ...
static pcre16_extra limit;
// ...
// hl_regexp_new_options
// ...
	limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
	limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
	int res = pcre16_exec(e->p,&limit,(PCRE_SPTR16)s,pos+len,pos,PCRE_NO_UTF16_CHECK,e->matches,e->nmatches * 3);
// ...

limit was a statically allocated struct that had some fields repeatedly reinitialized in hl_regexp_new_options and was used repeatedly by pcre16_exec in hl_regexp_match.

Now we have:

// ...
static pcre2_match_context_16 *match_context;
// ...
// hl_regexp_new_options
// ...
	// this is reinitialised for each new regex object...
	match_context = pcre2_match_context_create_16(NULL);
	pcre2_set_depth_limit_16(match_context, 3500); // adapted based on Windows 1MB stack size
// ...
// hl_regexp_match
// ...
	int res = pcre2_match_16(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
// ...

match_context is a statically allocated pointer to a struct that is repeatedly allocated and initialized by pcre2_match_context_create and has a field set by pcre2_set_depth_limit in hl_regexp_new_options and is used repeatedly by pcre2_match in hl_regexp_match.

The repeated calls of pcre2_match_context_create with no matching calls to pcre2_match_context_free is the introduction of a memory leak. This is why I first suggested moving match_context into ereg and adding pcre2_match_context_free to its finalizer. But I also suggested you move to a compile-time option of changing MATCH_LIMIT_DEPTH instead and removing the run-time code. The downside of using the compile-time option is that it affects the generated library and if we were sharing the generated library with other code that did not want such set by default it would be problematic. As it is we are generating the library only for our hashlink vm and I assume pcre is statically linked (or the dynamically linked library is specific to the built hashlink vm)

nekovm uses 8-bit code units (i.e., ASCII, UTF-8, etc.) but hashlink uses 16-bit code units (i.e., UTF-16).

In the old pcre.h API, 8-bit code unit interfaces have no numeric size references but 16-bit and 32-bit code unit interfaces are all designated with 16 or 32 in the names. The old API defines all three code unit interfaces together.

In the new pcre2.h API, all the different code unit interfaces are created by macros that result in _8, _16, or _32 being appended to them but the macros also require the consumer of the interface to define PCRE2_CODE_UNIT_WIDTH to specify the size of the code unit interface the consumer wants to use and it then provides a set of unsuffixed macro interfaces that automatically add the appropriate suffixes to the interface references based upon the provided (and required) PCRE2_CODE_UNIT_WIDTH. PCRE2_CODE_UNIT_WIDTH can also be defined to be 0 but this is for multi-width code unit applications where one must manually specify the suffixed interfaces, however, that is quite rare and in general a consumer only wants access to a single set of code unit interfaces, so this is simplified so they can all use the same code references that have no suffixes and the suffixes will be automatically applied by the judicious choice of how PCRE2_CODE_UNIT_WIDTH is defined.

For example, in the new pcre2.h API, nekovm and hashlink could in theory use a the same consumer interface code (if it used pcre in the same way) so long as they defined PCRE2_CODE_UNIT_WIDTH differently.

I recommend you drop the _16 suffixes in hashlink and now you know why 8 was not used in the nekovm consumer code and why you could have but did not need to add _8 to the new code there.

@tobil4sk
Copy link
Member Author

@Uzume Thanks for the insight!

match_context is a statically allocated pointer to a struct that is repeatedly allocated and initialized by pcre2_match_context_create

Yes, I see the difference now (and where the memory leak would be). The only reason it is repeatedly allocated is because, as far as I could tell, there was no way of allocating something when a module is first loaded, the same way that can be done with neko.

I also suggested you move to a compile-time option of changing MATCH_LIMIT_DEPTH

I think this is probably the best way, because aside from that there is no other reason for having a context object. Also, the context should really be global, as there is currently no reason to have a different context per regex object. Just about whether to put it in config.h or to put it in make and the other build systems.

I recommend you drop the _16 suffixes in hashlink

That explanation makes sense, so I do think now it would make sense to get rid of the suffix ;)

@tobil4sk
Copy link
Member Author

Alternatively, I guess we could check whether the context object is null before allocating it, but I guess that might have some thread safety concerns attached to it.

@Uzume
Copy link

Uzume commented Apr 27, 2022

@tobil4sk

EDIT: Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag.
Agreed. I also think the best method is in the build system (over config.h configuration changes; in fact generation of the config.h could be handled by the build system and the option could be provided that way).

There is a lot of indirection in the code but you can see the difference in how pcre2_set_depth_limit vs. MATCH_LIMIT_DEPTH works via this code:

// pcre2.h
// ...
#define PCRE2_TYPES_LIST \
//...
struct pcre2_real_match_context; \
typedef struct pcre2_real_match_context pcre2_match_context; \
// ...
#define PCRE2_JOIN(a,b) a ## b
#define PCRE2_GLUE(a,b) PCRE2_JOIN(a,b)
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a,PCRE2_LOCAL_WIDTH)
// ...
#define pcre2_match_context            PCRE2_SUFFIX(pcre2_match_context_)
// ...
#define PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS \
PCRE2_TYPES_LIST \
// ...
#define PCRE2_LOCAL_WIDTH 8
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH

#define PCRE2_LOCAL_WIDTH 16
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH

#define PCRE2_LOCAL_WIDTH 32
PCRE2_TYPES_STRUCTURES_AND_FUNCTIONS
#undef PCRE2_LOCAL_WIDTH
// ...
#undef PCRE2_SUFFIX
// ...
#if PCRE2_CODE_UNIT_WIDTH == 8 || \
    PCRE2_CODE_UNIT_WIDTH == 16 || \
    PCRE2_CODE_UNIT_WIDTH == 32
#define PCRE2_SUFFIX(a) PCRE2_GLUE(a, PCRE2_CODE_UNIT_WIDTH)
#elif PCRE2_CODE_UNIT_WIDTH == 0
#undef PCRE2_JOIN
#undef PCRE2_GLUE
#define PCRE2_SUFFIX(a) a
// ...
#endif
// ...

pcre2_match_context is defined to be struct pcre2_real_match_context (apparently three times).

// pcre2_intmodedep.h
// ...
typedef struct pcre2_real_match_context {
  pcre2_memctl memctl;
#ifdef SUPPORT_JIT
  pcre2_jit_callback jit_callback;
  void *jit_callback_data;
#endif
  int    (*callout)(pcre2_callout_block *, void *);
  void    *callout_data;
  int    (*substitute_callout)(pcre2_substitute_callout_block *, void *);
  void    *substitute_callout_data;
  PCRE2_SIZE offset_limit;
  uint32_t heap_limit;
  uint32_t match_limit;
  uint32_t depth_limit;
} pcre2_real_match_context;
// ...

struct pcre2_real_match_context is define with the last field named depth_limit.

// pcre2_context.c
// ...
const pcre2_match_context PRIV(default_match_context) = {
  { default_malloc, default_free, NULL },
#ifdef SUPPORT_JIT
  NULL,          /* JIT callback */
  NULL,          /* JIT callback data */
#endif
  NULL,          /* Callout function */
  NULL,          /* Callout data */
  NULL,          /* Substitute callout function */
  NULL,          /* Substitute callout data */
  PCRE2_UNSET,   /* Offset limit */
  HEAP_LIMIT,
  MATCH_LIMIT,
  MATCH_LIMIT_DEPTH };
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_set_depth_limit(pcre2_match_context *mcontext, uint32_t limit)
{
mcontext->depth_limit = limit;
return 0;
}
// ...

The last field of default_match_context of type pcre2_match_context is initialized to MATCH_LIMIT_DEPTH. While pcre2_set_depth_limit changes the depth_limit field of pcre2_match_context struct.

// pcre2_match.c
// ...
PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length,
  PCRE2_SIZE start_offset, uint32_t options, pcre2_match_data *match_data,
  pcre2_match_context *mcontext)
// ...
#ifdef SUPPORT_JIT
// ...
  rc = pcre2_jit_match(code, subject, length, start_offset, options,
    match_data, mcontext);
// ...
#endif  /* SUPPORT_JIT */
// ...
if (mcontext == NULL)
  {
  mcontext = (pcre2_match_context *)(&PRIV(default_match_context));
  mb->memctl = re->memctl;
  }
// ...
mb->match_limit_depth = (mcontext->depth_limit < re->limit_depth)?
  mcontext->depth_limit : re->limit_depth;
// ...

pcre2_match take a last parameter named mcontext of type pointer to pcre2_match_data. If SUPPORT_JIT is defined mcontext gets passed as the last argument to pcre2_jit_match. Otherwise if mcontext is NULL it is updated to point to default_match_context and later depth_limit field of mcontext is potentially used to set match_limit_depth field of whatever mb points to (which is immaterial since it will be the depth_limit field of either the provided pcre2_match_data or default_match_context where that field was initialized to MATCH_LIMIT_DEPTH ).

I shall leave it to you to figure out that in the case of SUPPORT_JIT pcre2_jit_match never touches the depth_limit field of the any pcre2_match_data provided or not (via NULL).

@Uzume
Copy link

Uzume commented Apr 27, 2022

Yes, I see the difference now (and where the memory leak would be). The only reason it is repeatedly allocated is because, as far as I could tell, there was no way of allocating something when a module is first loaded, the same way that can be done with neko.

Alternatively, I guess we could check whether the context object is null before allocating it, but I guess that might have some thread safety concerns attached to it.

@tobil4sk You might also be able to do something along the lines of:

static pcre2_match_context *match_context = pcre2_match_context_create(NULL);

In that case you would still have a memory leak but of just a single solitary pcre2_match_context which you could later repeatedly use pcre2_set_depth_limit on if you wanted. And I am not sure it could really be called a memory leak as per se because it will go away when the process dies just like statically allocated items anyway (meaning it is sort of pseudo static). And there should be no thread safety concerns that way (unless you count repeatedly updating the field but that existed before anyway).

Sadly you cannot do something like this:

static pcre2_match_context match_context;

Because of the way pcre2 has defined pcre2_match_context as an opaque type unlike how pcre_extra and pcre16_extra types were defined in pcre1.

@tobil4sk
Copy link
Member Author

Agreed. I also think the best method is in the build system (over config.h configuration changes; in fact generation of the config.h could be handled by the build system and the option could be provided that way).

The only issue is that if PCRE2_CODE_UNIT_WIDTH is missed out, an error is given, however, if MATCH_LIMIT_DEPTH somehow ends up missed out then nothing is shown (we could do this ourselves I guess). Generating the file via the build systems would be complicated as this would have to be done for each one of them separately (current make, cmake, and Visual Studio are all available as separate options).

Also, currently, there is a comment explaining why this value for MATCH_LIMIT_DEPTH is chosen, however, I'm not sure where this comment could go if the flag is defined in all these different places (probably will just have to be copied to make and cmake).

static pcre2_match_context *match_context = pcre2_match_context_create(NULL);

Unfortunately, it doesn't allow this because it is not a constant value (would have been a good solution though).

@Uzume
Copy link

Uzume commented Apr 27, 2022

Unfortunately, it doesn't allow this because it is not a constant value (would have been a good solution though).

It is a constant value (after initial allocation the pointer would never change during the entire execution) but you are right it is not a so called compile-time constant and static initialization would require such. I had a momentary lapse (sorry about that). It has been a while since I had to be concerned with lifetimes in C.

@Uzume
Copy link

Uzume commented Apr 27, 2022

Upon thinking about this more, I am also concerned with some other issues (that were already in the original code).

For example match_context (and the earlier limit) are globally scoped despite only ever being used in hl_regexp_match. Whether the value has a static lifetime or not, it should probably be moved to be local to that function. If we consider it local to the function, why not have a solution like this:

HL_PRIM bool hl_regexp_match( ereg *e, vbyte *s, int pos, int len ) {
	static pcre2_match_context *match_context = NULL;
	int res;
	if (match_context == NULL) {
		match_context = pcre2_match_context_create(NULL);
		pcre2_set_depth_limit(match_context, 3500); // adapted based on Windows 1MB stack size;
	}
	res = pcre2_match(e->regex,(PCRE2_SPTR16)s,pos+len,pos,PCRE2_NO_UTF_CHECK,e->match_data,match_context);
	e->matched = res >= 0;
	if( res >= 0 )
		return true;
	if( res != PCRE2_ERROR_NOMATCH )
		hl_error("An error occurred while running pcre2_match()");
	return false;
}

It is not a limit of the regular expression but rather a limit of the matching so unless we decide to put this into ereg (for its finalizer), why was this ever initialized in hl_regexp_new_options to begin with?

@tobil4sk
Copy link
Member Author

tobil4sk commented May 7, 2022

@Uzume I think if we can handle it at compile time then that is the preferable solution.

For now I have placed it in each build file as well as config.h just to be safe, in case pcre2 is updated in future and the config.h is replaced, which would cause the setting to be lost. Is this fine? @ncannasse

@tobil4sk
Copy link
Member Author

tobil4sk commented May 7, 2022

If we ever decide to link pcre2 dynamically this will have to be handled at run time again, maybe if that happens using a local static might be a fair solution.

@Uzume
Copy link

Uzume commented May 8, 2022

@Uzume I think if we can handle it at compile time then that is the preferable solution.

For now I have placed it in each build file as well as config.h just to be safe, in case pcre2 is updated in future and the config.h is replaced, which would cause the setting to be lost. Is this fine? @ncannasse

@tobil4sk I am not sure carrying that match depth limit is really still necessary. I can see how you are just porting that from the PCRE1 integration code in hashlink but the internal architecture has of PCRE has changed significantly so I am not sure it is really even interesting much less necessary to consider keeping that in the first place. For example, in NEWS you can see the following text for Version 10.30 14-August-2017:

  1. The main interpreter, pcre2_match(), has been refactored into a new version
    that does not use recursive function calls (and therefore the system stack) for
    remembering backtracking positions. This makes --disable-stack-for-recursion a
    NOOP. The new implementation allows backtracking into recursive group calls in
    patterns, making it more compatible with Perl, and also fixes some other
    previously hard-to-do issues. For patterns that have a lot of backtracking, the
    heap is now used, and there is an explicit limit on the amount, settable by
    pcre2_set_heap_limit() or (*LIMIT_HEAP=xxx). The "recursion limit" is retained,
    but is renamed as "depth limit" (though the old names remain for
    compatibility).

Based on just the above quote, I'd say newer default implementations (not dfa or jit) have traded some stack pressure for heap pressure.

Also in NON-AUTOTOOLS-BUILD in the section labelled STACK SIZE IN WINDOWS ENVIRONMENTS it states:

Prior to release 10.30 the default system stack size of 1MiB in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.

Previous in the same file and section for PCRE1 it stated:

The default processor stack size of 1Mb in some Windows environments is too
small for matching patterns that need much recursion. In particular, test 2 may
fail because of this. Normally, running out of stack causes a crash, but there
have been cases where the test program has just died silently. See your linker
documentation for how to increase stack size if you experience problems. The
Linux default of 8Mb is a reasonable choice for the stack, though even that can
be too small for some pattern/subject combinations.

PCRE has a compile configuration option to disable the use of stack for
recursion so that heap is used instead. However, pattern matching is
significantly slower when this is done. There is more about stack usage in the
"pcrestack" documentation.

You can see similar text in earlier PCRE2 (the latter two about pcre2stack were later removed):

The code you are trying to port seems to refer to the same Windows 1Mb stack size:

	limit.flags = PCRE_EXTRA_MATCH_LIMIT_RECURSION;
	limit.match_limit_recursion = 3500; // adapted based on Windows 1MB stack size

And I am sure there were other drastic changes during the changed from PCRE1 to PCRE2, so the question is do we need it and do we bother? I am not sure I am in a position to answer that but methinks it definitely makes sense to ask such.

@Uzume
Copy link

Uzume commented May 8, 2022

@tobil4sk

[...]This the documentation file that describes how to build pcre2 on Windows which I followed:
https://github.com/PCRE2Project/pcre2/blob/master/NON-AUTOTOOLS-BUILD

In following NON-AUTOTOOLS-BUILD, I am not sure why you choose to include config.h (which requires externally defining HAVE_CONFIG_H to have any effect) at all. It is not required.

PCRE2 seems to provide four build possibilities with regard to config.h:

  1. Use a configure and make build that creates a config.h from src/config.h.in (which also defines HAVE_CONFIG_H)
  2. Use a cmake and Visual Studio (really MSBuild) or make build that creates a config.h from cmake/pcre2-config.cmake.in (which also defines HAVE_CONFIG_H)
  3. Use a "manual" (for another or custom build system) build, create a config.h from src/config.h.generic (or some other custom method) and define HAVE_CONFIG_H
  4. Use a "manual" (for another or custom build system) build without config.h and do not define HAVE_CONFIG_H

If I am doing a "manual" or custom build (where I am not using the build scripts from the PCRE sources), methinks I would prefer option 4 over option 3 as that route allows one to integrate things with one less extraneous file.

Although, the only issue with this is that now the variable must be set externally (either in config.h or every build system) rather then conveniently in the regexp file. Setting it manually in make/cmake/vs builds is a bit messy, but at the same time having it in config.h means that there's a risk that when updating the library in future, someone may miss this configuration out, which would silently cause problems.

EDIT: Although, this is already necessary for PCRE2_CODE_UNIT_WIDTH, so I guess it won't do harm to add the extra flag.

Option 4 and setting things externally in the build system(s) also alleviates the future risk you mention above, while also allowing one to integrate the code with one less file.

The only issue is that if PCRE2_CODE_UNIT_WIDTH is missed out, an error is given, however, if MATCH_LIMIT_DEPTH somehow ends up missed out then nothing is shown (we could do this ourselves I guess). Generating the file via the build systems would be complicated as this would have to be done for each one of them separately (current make, cmake, and Visual Studio are all available as separate options).

Also, currently, there is a comment explaining why this value for MATCH_LIMIT_DEPTH is chosen, however, I'm not sure where this comment could go if the flag is defined in all these different places (probably will just have to be copied to make and cmake).

Removing config.h alleviates any such issues in the future (unless someone adds it back).

PCRE1 also optionally used a config.h and the previous integration here also used such but my question is why are we doing this?

@tobil4sk
Copy link
Member Author

tobil4sk commented May 8, 2022

There are multiple other settings for specifying static linking and disabling jit support that are defined in this config file and it wouldn't make sense to define the long list of flags everywhere. These settings are pretty obvious as they are mentioned by name in the guide. The only reason there is a risk of missing out the match limit depth setting is because you need to know that it was configured before. I guess we could have a file somewhere explaining how to update the library so that the settings are not lost.

@tobil4sk
Copy link
Member Author

tobil4sk commented May 8, 2022

The stuff about the stack/heap changes sounds promising. If we can get rid of this setting then all the problems are solved. I doubt there were hashlink tests to make sure whatever cases this setting was for were not crashing, but hopefully we can just safely remove it altogether judging by the fact that the stack is no longer used here?

@Uzume
Copy link

Uzume commented May 8, 2022

The stuff about the stack/heap changes sounds promising. If we can get rid of this setting then all the problems are solved.

@tobil4sk The stack is used but not to the same extent so it is much less likely to be causing an issue (which the code you are trying to port seems to be trying to avoid; likely because there was an issue in the past). For that reason, I believe it should not cause issues to now remove it but I have no way to actually test this theory. Maybe @ncannasse knows. He was the one that added that setting in the original PCRE1 integration on 2016-02-21, see df1828b (search for PCRE_EXTRA_MATCH_LIMIT_RECURSION).

There is another potential option too. Since this setting seems only applicable to Windows, we could only set it on Windows builds. It might be possible to do this externally from the build scripts but if we opted to keep the config.h, it could be done something like:

#if defined(_WIN32) && !defined(MATCH_LIMIT_DEPTH)
#define MATCH_LIMIT_DEPTH 3500 // adapted based on Windows 1MB stack size
#endif

Then at least it would not be applied across all builds when obviously it is a tweak specific to Windows.

But frankly I do not think this does the same thing as it used to do. In digging through PCRE2 history I can see changes in config.h.generic (which is generated but still I am sure I could find the source of such if I tried). Currently:

/* The above limit applies to all backtracks, whether or not they are nested.
   In some environments it is desirable to limit the nesting of backtracking
   (that is, the depth of tree that is searched) more strictly, in order to
   restrict the maximum amount of heap memory that is used. The value of
   MATCH_LIMIT_DEPTH provides this facility. To have any useful effect, it
   must be less than the value of MATCH_LIMIT. The default is to use the same
   value as MATCH_LIMIT. There is a runtime method for setting a different
   limit. In the case of pcre2_dfa_match(), this limit controls the depth of
   the internal nested function calls that are used for pattern recursions,
   lookarounds, and atomic groups. */
#ifndef MATCH_LIMIT_DEPTH
#define MATCH_LIMIT_DEPTH MATCH_LIMIT
#endif

Notice: "limit the nesting of backtracking[...] in order to restrict the maximum amount of heap memory that is used."

And earlier:

/* The above limit applies to all backtracks, whether or not they are nested.
   In some environments it is desirable to limit the nesting of backtracking
   more strictly, in order to restrict the maximum amount of heap memory that
   is used. The value of MATCH_LIMIT_RECURSION provides this facility. To have
   any useful effect, it must be less than the value of MATCH_LIMIT. The
   default is to use the same value as MATCH_LIMIT. There is a runtime method
   for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif

Apparently this changed names from MATCH_LIMIT_RECURSION.

And earlier still:

/* The above limit applies to all calls of match(), whether or not they
   increase the recursion depth. In some environments it is desirable to limit
   the depth of recursive calls of match() more strictly, in order to restrict
   the maximum amount of stack (or heap, if HEAP_MATCH_RECURSE is defined)
   that is used. The value of MATCH_LIMIT_RECURSION applies only to recursive
   calls of match(). To have any useful effect, it must be less than the value
   of MATCH_LIMIT. The default is to use the same value as MATCH_LIMIT. There
   is a runtime method for setting a different limit. */
#ifndef MATCH_LIMIT_RECURSION
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
#endif

Notice: "limit the depth of recursive calls[...] in order to restrict the maximum amount of stack[...] that is used."

So it used to limit stack usage during recursive backtracking but now limits heap usage during backtracking (which no longer uses recursion). I am assuming it was possible to optimize the algorithm to use tail-recursion (if it wasn't already using such) which was then turned in to a loop. Thus less stack allocations and more heap allocations (which doesn't threaten the 1Mb stack default on Windows).

I doubt there were hashlink tests to make sure whatever cases this setting was for were not crashing, but hopefully we can just safely remove it altogether judging by the fact that the stack is no longer used here?

I agree. I doubt there are tests for this and as such, I recommend just ripping this out and see if someone screams. It can always be put back in if necessary.

This should likely also be done for nekovm too as the pcre2_set_depth_limit code seems to be unnecessary. As a side note, there is a newer version of PCRE2 out too. It seems you have integrated pcre2-10.39 dated 2021-10-29 but there is a pcre2-10.40 dated 2022-04-14. When you push an update, rip out the unnecessary pcre2_set_depth_limit code (technically this is also a memory leak but it is not normally repeatedly executed unlike here). I am not sure if you want to bother updating this to 10.40 now or not.

@tobil4sk
Copy link
Member Author

I'm quite worried with the following:
Seems like a memory leak because the context is never free

@ncannasse We have resolved this issue. This setting is obsolete, so we just safely removed the offending lines. Everything is building properly now so it should be safe to merge now, unless there are any other questions/worries?

@tobil4sk
Copy link
Member Author

tobil4sk commented Aug 21, 2022

Actually I noticed the PCRE_JAVASCRIPT_COMPAT flag was removed without being replaced. It seems the pcre2 alternative is PCRE2_ALT_BSUX. I'll quickly add that and we should be good.

However, if either of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are interpreted as ECMAScript interprets them.

https://www.pcre.org/current/doc/html/pcre2compat.html

EDIT: Done now.

Copy link

@Uzume Uzume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobil4sk It looks good. I hope this gets committed soon. Thanks for all your hard work.

According to my research PCRE_JAVASCRIPT_COMPAT was split into PCRE2_ALT_BSUX, PCRE2_ALLOW_EMPTY_CLASS and PCRE2_MATCH_UNSET_BACKREF. See 2015-01-05: [pcre-dev] PCRE2 is released:

[...] The PCRE_JAVASCRIPT_COMPAT option has been split into
independent functional options PCRE2_ALT_BSUX, PCRE2_ALLOW_EMPTY_CLASS, and
PCRE2_MATCH_UNSET_BACKREF.

Perhaps those should be added too. You might also want to consider switching from PCRE2_ALT_BSUX to PCRE2_EXTRA_ALT_BSUX (which implies the former) adding ECMAscript 6 style \u{hhh..} hexadecimal character codes.

@tobil4sk
Copy link
Member Author

@tobil4sk It looks good. I hope this gets committed soon. Thanks for all your hard work.

Thanks as well for all your help!

PCRE_JAVASCRIPT_COMPAT was split into PCRE2_ALT_BSUX, PCRE2_ALLOW_EMPTY_CLASS and PCRE2_MATCH_UNSET_BACKREF

Thanks for the shout. I missed those patch notes because the compat page seemed to suggest ALT_BSUX was all there was to it.

Just for reference, here is a page explaining PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX.: https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC3

The problem with PCRE2_EXTRA_ALT_BSUX is that it requires having a context object again, which is what was causing us the memory leak issue in the first place. However, if adding it matches the behaviour of other haxe targets I guess it makes sense to figure out a way to set it.

@tobil4sk
Copy link
Member Author

Also, reading further through those notes you sent, I wonder if it would be worth implementing a wrapper for pcre2_substitute

"There is a new function called pcre2_substitute() that performs "find and replace" operations."

This would probably be faster than the current replace implementation on hashlink. However, that would probably fit better in a separate PR once this is merged.

@Uzume
Copy link

Uzume commented Aug 30, 2022

Just for reference, here is a page explaining PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX.: https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC3

The problem with PCRE2_EXTRA_ALT_BSUX is that it requires having a context object again, which is what was causing us the memory leak issue in the first place. However, if adding it matches the behaviour of other haxe targets I guess it makes sense to figure out a way to set it.

@tobil4sk: That is an interesting read, however, based on that, I wonder about the value of implementing either of PCRE2_ALT_BSUX and/or PCRE2_EXTRA_ALT_BSUX as they only seem alter the interpretation of \x (without braces) and allow alternative versions of the \u escape. It seems to me that such things are already available via \x{hh..} and \N{U+hh..} escape constructs.

Also, reading further through those notes you sent, I wonder if it would be worth implementing a wrapper for pcre2_substitute

"There is a new function called pcre2_substitute() that performs "find and replace" operations."

This would probably be faster than the current replace implementation on hashlink. However, that would probably fit better in a separate PR once this is merged.

I was also thinking along similar lines to improve the Haxe EReg implementations after getting all backends ported from pcre to pcre2 (and among those is definitely replace, including the HL implementation you pointed out). That said, we should probably expunge pcre1 API code everywhere via HaxeFoundation/haxe#10491. I did a little groundwork on the Ocaml backend and was considering the best means to fix the Lua backend (jdonaldson/haxe-deps probably should be merged into HaxeFoundation/haxe so the Luarocks package is controlled by the same repo as the code that depends on it) but I was distracted by other things.

This was done following the steps on in the `NON-AUTOTOOLS-BUILD` file
found in the root of the pcre2 repository:
https://github.com/PhilipHazel/pcre2/blob/pcre2-10.40/NON-AUTOTOOLS-BUILD
CMake now has to specify each pcre file individually, as some are not meant to be compiled by themselves.
These were necessary in pcre1 but are handled by PCRE2_CODE_UNIT_WIDTH
in pcre2
Since pcre 10.30, pcre2_match no longer uses recursive function calls,
so this setting should no longer be needed.

This avoids the memory leak caused by having to configure this setting
using a context structure allocated at run time.
This is the pcre2 equivalent of the PCRE_JAVASCRIPT_COMPAT flag we used
previously. Compare bullet point 4 of the pcre2 compatibility docs:
https://www.pcre.org/current/doc/html/pcre2compat.html
to bullet point 5 of the pcre1 compatibility docs:
https://www.pcre.org/original/doc/html/pcrecompat.html
@tobil4sk
Copy link
Member Author

tobil4sk commented Dec 1, 2022

I've done some investigating regarding how other targets handle these escape characters:

Details regarding Haxe target support for \x and \u

Firstly, \u{...} is not supported by Haxe regex literal syntax.

It is still possible to test it via new EReg("\\u{...}", ""). What this reveals is that no haxe target properly supports the \u{...} syntax. I think technically JavaScript could support it, but we specifically block the u flag:
https://github.com/HaxeFoundation/haxe/blob/48d2d66dde8dd160d6ebafe7048ee6fe80d7aa00/std/js/_std/EReg.hx#L26

A lot of targets (C++, Interp, Lua, Neko, PHP) give an error even for \uhhhh without the brackets. C#, Java, and Python support \uhhhh but give an error for \u{hh...}. JavaScript, Flash, and Hashlink also support \uhhhh but treat \u{hh...} simply as a u.

All targets support \xhh, and most support \x{hh...}. C# and Python give an error when given \x{hh...} and JavaScript, Flash, and Hashlink treat \x{hh...} as x.

TLDR: There is not much consistency, but currently Hashlink is consistent with JavaScript and Flash.

I wonder about the value of implementing either of PCRE2_ALT_BSUX and/or PCRE2_EXTRA_ALT_BSUX

For now I think the best thing to do here is to keep Hashlink's behaviour the same, which is achieved in pcre2 by enabling PCRE2_ALT_BSUX, but not PCRE2_EXTRA_ALT_BSUX.

@tobil4sk
Copy link
Member Author

tobil4sk commented Dec 1, 2022

For now I think the best thing to do here is to keep Hashlink's behaviour the same

@ncannasse If that sounds alright, then I don't think there is anything else left to do here. We managed to get rid of the memory leak issue.

`PCRE2_ALT_BSUX`, `PCRE2_ALLOW_EMPTY_CLASS`,
and `PCRE2_MATCH_UNSET_BACKREF`
are equivalent to the old `PCRE_JAVASCRIPT_COMPAT`.

See:
https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html
@Uzume
Copy link

Uzume commented Dec 3, 2022

I've done some investigating regarding how other targets handle[...] \x and \u

TLDR: There is not much consistency, but currently Hashlink is consistent with JavaScript and Flash.

For now I think the best thing to do here is to keep Hashlink's behaviour the same, which is achieved in pcre2 by enabling PCRE2_ALT_BSUX, but not PCRE2_EXTRA_ALT_BSUX.

Your investigation and detailed report are much appreciated.

I agree getting the targets at least close to aligned while jettisoning older pcre1 is the first priority but in the larger picture it might make sense to consider changes here down the road (along with other changes to EReg like the pcre2_substitute optimizations to replace mentioned earlier).

I see you also added the aforementioned PCRE2_ALLOW_EMPTY_CLASS and PCRE2_MATCH_UNSET_BACKREF. These are some likely candidates for future removal during a later cleanup. I would like to remove PCRE2_ALT_BSUX and PCRE2_UCP too as their functionality is available via other means anyway (but such changes will affect Haxe code of course).

It seems odd to have so many compatibility issues with something called "PCRE" which by its very name implies compatibility with Perl regex but such is the nature the beast.

@tobil4sk: Thank you.

@Simn
Copy link
Member

Simn commented Mar 25, 2023

I would really like to merge this and get this whole PCRE2 business out of the way. Is there any reason not to do so yet?

@ncannasse ncannasse merged commit 2206f8c into HaxeFoundation:master Mar 25, 2023
@ncannasse
Copy link
Member

All good to me, thanks for everyone who participated in the issue/PR, merge complete!

@tobil4sk
Copy link
Member Author

Thanks! This just leaves hxcpp now... 😅

@tobil4sk tobil4sk deleted the pcre2 branch March 25, 2023 11:03
@Uzume
Copy link

Uzume commented Mar 25, 2023

@ncannasse Thanks for getting this merged.

@tobil4sk Thanks for your hard work and patches.

Maybe PCRE1 can soon be laid to rest from a Haxe perspective and we can look into cleaning up Haxe EReg (with some possible optimizations like wrapping and using pcre2_substitute for replace). I would like to get rid of PCRE2_ALT_BSUX and PCRE2_UCP at some point but that will be a breaking syntactical change for Haxe regex so we should probably consider that at a major version change (where we can document the change and people expect some potential porting issues).

@skial skial mentioned this pull request Mar 29, 2023
1 task
RandomityGuy pushed a commit to RandomityGuy/hashlink that referenced this pull request May 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants