librex-ast is a feature-rich, PCRE2-compatible regular expression engine written in modern C. It provides a complete pipeline for compiling regex patterns into an efficient internal bytecode representation and executing them against subject strings using a backtracking NFA-style virtual machine.
Originally an AST parser, the library has evolved into a complete engine designed for correctness, feature-completeness, and PCRE2 compatibility. It serves as an excellent foundation for building custom tools, educational software, or for anyone interested in the internals of modern regex engines.
This library is a comprehensive implementation of a modern regex engine, capable of passing a large compatibility test suite. However, before using it in a production environment, please be aware of the following:
- Not Yet Performance-Tuned: The VM implementation is focused on correctness and feature support. It has not yet been optimized for high-throughput production workloads and may be significantly slower than mature libraries like PCRE2.
- Simplified Unicode Properties: The
\p{...}
property support is comprehensive but uses a simplified, non-exhaustive set of Unicode data. A production-ready version would require data generated from the full Unicode Character Database (UCD).
Contributions to address these points are highly welcome!
- PCRE2-Compatible Engine: A complete engine that compiles patterns and matches strings.
- Two-Stage Compilation:
- Parser: A robust recursive-descent parser builds a detailed Abstract Syntax Tree (AST) from the pattern, resolving forward references for constructs like
\k<name>
. - Compiler: The AST is then compiled into a compact and efficient bytecode format, where capturing groups become callable subroutines.
- Parser: A robust recursive-descent parser builds a detailed Abstract Syntax Tree (AST) from the pattern, resolving forward references for constructs like
- Backtracking NFA Virtual Machine: The engine uses a custom, non-recursive, stack-based VM to execute the bytecode, providing powerful and correct matching behavior while avoiding C stack overflow.
- Extensive Syntax Support: Implements a wide array of PCRE2/Perl constructs, including atomic groups, possessive quantifiers, advanced assertions, conditionals, and subroutines.
- Structured Error Reporting: The compile function provides a detailed
regex_err
object with an error code, message, and the exact position (line
,col
) of the error in the pattern. - Pluggable Memory Allocators: The entire library can be configured to use custom
malloc
,realloc
, andfree
functions for specialized memory environments. - Clean, Opaque API: The compiled pattern is managed through an opaque
regex_compiled*
handle, hiding internal complexity. - Unicode-Aware: Correctly parses UTF-8 patterns, Unicode escape sequences (
\x{...}
), and Unicode properties (\p{L}
). - Zero Dependencies: Written in pure C99 with no external library dependencies.
The engine operates in three distinct stages:
- Parsing (
regex-parser.c
): The input pattern string is parsed into an Abstract Syntax Tree (AST). This tree represents the logical structure of the regular expression. This stage validates syntax and reports errors. - Compilation (
regex-match.c
): The AST is traversed by a compiler which emits a linear sequence of bytecode instructions. This bytecode is a low-level representation of the matching logic, optimized for the VM. - Execution (
regex-match.c
): The generated bytecode is executed by a lightweight, stack-based virtual machine (VM). The VM performs the actual matching against the subject string using a backtracking NFA algorithm, managing alternative paths on an explicit stack.
Here is a simple example of how to compile a pattern, execute a match, and inspect the results.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "regex-parser.h"
int main(void) {
const char* pattern = "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})";
const char* subject = "The date is 2023-11-28.";
regex_err error = {0};
// 1. Compile the pattern
regex_compiled* rx = regex_compile(pattern, 0, &error);
if (!rx) {
fprintf(stderr, "Failed to compile pattern at position %d: %s\n",
error.pos, error.msg);
return 1;
}
printf("Successfully compiled pattern: \"%s\"\n", pattern);
// 2. Execute the match
regex_match_result result = {0};
int match_status = regex_match(rx, subject, strlen(subject), &result);
if (match_status == 1) { // 1 indicates a successful match
printf("Match found from index %d to %d!\n", result.match_start, result.match_end);
printf(" - Full match: '%.*s'\n",
(int)(result.match_end - result.match_start),
subject + result.match_start);
// Capture groups are 1-indexed for users, but 0-indexed in the result arrays.
for (int i = 0; i < result.capture_count; i++) {
if (result.capture_starts[i] != -1) {
printf(" - Group %d: '%.*s'\n", i + 1,
(int)(result.capture_ends[i] - result.capture_starts[i]),
subject + result.capture_starts[i]);
}
}
// 3. Free the match result's internal arrays
regex_free_match_result(&result, NULL);
} else if (match_status == 0) {
printf("No match found.\n");
} else {
// A return value (other than 1 and 0) is a runtime error code
fprintf(stderr, "Execution error: %s\n", regex_error_message(match_status));
}
// 4. Free the compiled regex object
regex_free(rx);
return 0;
}
regex_compiled
: An opaque pointer representing a compiled regular expression.regex_err
: A struct containing detailed error information (code
,pos
,line
,col
,msg
).regex_match_result
: A struct containing the results of a successful match, including overall match bounds and arrays of capture group bounds.regex_allocator
: A struct allowing you to provide custom memory allocation functions.
-
regex_compiled* regex_compile(const char* pattern, uint32_t flags, regex_err* error)
Compiles a null-terminatedpattern
string using standard library allocators. On success, returns an opaqueregex_compiled*
handle. On failure, returnsNULL
and populates theerror
struct. -
regex_compiled* regex_compile_with_allocator(const char* pattern, uint32_t flags, const regex_allocator* allocator, regex_err* error)
Same asregex_compile
but uses a customregex_allocator
. -
void regex_free(regex_compiled* rx)
Frees all memory associated with a compiled regex handle. -
int regex_match(regex_compiled* rx, const char* subject, size_t subject_len, regex_match_result* result)
Executes a compiled regexrx
against asubject
string.- Returns
1
on a successful match and populates theresult
struct. - Returns
0
if no match is found. - Returns a positive
REGEX_ERR_*
code on a runtime error (e.g.,REGEX_ERR_RECURSION_LIMIT
,REGEX_ERR_MEMORY
).
- Returns
-
void regex_free_match_result(regex_match_result* result, const regex_allocator* allocator)
Frees the capture group arrays within aregex_match_result
struct. PassNULL
for the allocator to use the standard libraryfree
.
-
const char* regex_error_message(int error_code)
Returns a static, human-readable string for a givenREGEX_ERR_*
code. -
void print_regex_ast(const regex_compiled* rx)
A debugging utility to print a human-readable representation of the internal AST to standard output.
REG_IGNORECASE
: Perform case-insensitive matching.REG_MULTILINE
:^
and$
match the start/end of lines, not just the start/end of the string.REG_SINGLELINE
:.
(dot) metacharacter matches all characters, including newlines.REG_EXTENDED
: Ignore unescaped whitespace and comments starting with#
.REG_UNGREEDY
: Invert the greediness of quantifiers (e.g.,*
becomes lazy,*?
becomes greedy).
The engine supports a wide subset of PCRE2/Perl syntax.
- Literal characters (full Unicode support)
.
(dot metacharacter, respectsREG_SINGLELINE
)- Character classes
[abc]
,[^abc]
,[a-z]
- Pre-defined classes:
\d
,\D
,\w
,\W
,\s
,\S
- Anchors:
^
,$
,\A
,\z
,\b
(word boundary),\B
- Greedy:
*
,+
,?
,{n,m}
- Lazy (non-greedy):
*?
,+?
,??
,{n,m}?
- Possessive:
*+
,++
,?+
,{n,m}+
- Capturing groups:
(...)
- Non-capturing groups:
(?:...)
- Named capturing groups:
(?<name>...)
,(?'name'...)
- Atomic groups:
(?>...)
- Branch-reset groups:
(?|...)
- Positive Lookahead:
(?=...)
- Negative Lookahead:
(?!...)
- Positive Lookbehind:
(?<=...)
(fixed-length only) - Negative Lookbehind:
(?<!...)
(fixed-length only)
- Numbered backreferences:
\1
,\2
, etc. - Named backreferences:
\k<name>
or\k'name'
- Numbered subroutine calls:
(?1)
,(?2)
, etc. - Named subroutine calls:
(?&name)
- Full pattern recursion:
(?R)
- Numeric condition:
(?(1)yes|no)
- Named condition:
(?(<name>)yes|no)
or(?('name')yes|no)
- Assertion condition:
(?(?=...)yes|no)
- Inline flags:
(?i)
,(?-m)
- Scoped flags:
(?i:...)
- Unicode properties:
\p{L}
,\P{Sc}
- Unicode escapes:
\x{...}
,\xAB
,\uABCD
- POSIX character classes:
[[:alpha:]]
,[[:digit:]]
, etc. (Unicode-aware)
- Comments:
(?#...)
- Quoted sequences:
\Q...\E
The engine is comprehensive but does not yet implement every feature of PCRE2. The following limitations currently apply:
- Lookbehind Assertions: Must be fixed-length. Variable-length lookbehind (e.g.,
(?<=a*)
) is not supported. The maximum lookbehind length is 255 characters. - Unicode Support:
- The built-in Unicode property (
\p{...}
) and POSIX class ([[:alpha:]]
) support is based on a partial character database and does not cover all Unicode scripts or categories. - Grapheme cluster matching (
\X
) and script run matching are not supported.
- The built-in Unicode property (
- Backreferences & Subroutines:
- The
\g{...}
backreference/subroutine syntax is not supported. Use the alternative forms\k<name>
,(?n)
, or(?&name)
instead. - Recursion/subroutine call depth is limited to 32 (
MAX_CALL_DEPTH
).
- The
- Control Verbs: Verbs that control the backtracking engine, such as
(*SKIP)
,(*FAIL)
,(*ACCEPT)
,(*COMMIT)
, etc., are not supported. - Callouts: The
(?C...)
callout feature for interacting with external C code during a match is not implemented. - Generic Newline Sequences: The
\R
shortcut for matching generic newline sequences is not supported. Use an explicit alternation like\r?\n
or a character class[\r\n]
instead. - Engine Optimizations: No AST or bytecode optimization passes (e.g., peephole optimizations) are currently performed, which may impact performance on complex patterns.
The project uses CMake and has no external dependencies. You will need a C99-compatible compiler (e.g., GCC, Clang, MSVC) and CMake (version 3.10 or later).
Follow these steps to build the library and the test suite from the command line:
# 1. Create a directory for an out-of-source build
mkdir build
cd build
# 2. Configure the project using CMake
# On Linux/macOS with Makefiles (default)
cmake ..
# On Windows, you might generate a Visual Studio solution
# cmake .. -G "Visual Studio 17 2022"
# 3. Compile the project
# This universal command works with most generators (Makefiles, Ninja, etc.)
cmake --build .
This process will generate the following in your build
directory:
- Shared Library:
librex_ast.dll
on Windows, orliblibrex_ast.so
/liblibrex_ast.dylib
on Unix-like systems. - Static Library:
librex_ast.lib
on Windows, orliblibrex_ast.a
on Unix-like systems. - Test Executable:
regex_test
(orregex_test.exe
on Windows).
After a successful build, you can run the test suite from the build
directory:
# On Linux/macOS
./regex_test
# On Windows
.\regex_test.exe
You can customize the build with CMake options passed during the configuration step.
ENABLE_WARNINGS
: Toggles extra compiler warnings (e.g.,-Wall -Wextra
). It isON
by default.
To build with this option disabled:
cmake .. -DENABLE_WARNINGS=OFF
This project is an excellent platform for exploring regex engine design. Contributions are welcome to make it more robust and performant. High-priority areas include:
- Performance Optimization: Profile and optimize the VM loop, reduce memory allocations during matching, and implement bytecode optimizations.
- Full Unicode Support: Integrate code generation scripts to build comprehensive property, script, and character-folding tables from the official Unicode Character Database (UCD).
- JIT Compilation: Explore adding a Just-In-Time (JIT) compiler backend (e.g., targeting x86-64) for performance-critical patterns.
- CI & Testing: Expand the test suite for more comprehensive coverage and set up continuous integration checks.
Please feel free to open an issue to discuss a new feature or submit a pull request.
This project is licensed under the MIT License.
Mounir IDRASSI mounir.idrassi@amcrypto.jp