Skip to content

A feature-rich, zero-dependency, PCRE2-compatible regular expression engine in C99, featuring a parser, bytecode compiler, and a backtracking NFA virtual machine.

License

Notifications You must be signed in to change notification settings

idrassi/librex-ast

Repository files navigation

librex-ast

License: MIT

librex-ast is a feature-rich, PCRE2-compatible regular expression engine written in modern C. It provides a complete pipeline for compiling regex patterns into an efficient internal bytecode representation and executing them against subject strings using a backtracking NFA-style virtual machine.

Originally an AST parser, the library has evolved into a complete engine designed for correctness, feature-completeness, and PCRE2 compatibility. It serves as an excellent foundation for building custom tools, educational software, or for anyone interested in the internals of modern regex engines.


⚠️ Project Status: Advanced & Experimental

This library is a comprehensive implementation of a modern regex engine, capable of passing a large compatibility test suite. However, before using it in a production environment, please be aware of the following:

  • Not Yet Performance-Tuned: The VM implementation is focused on correctness and feature support. It has not yet been optimized for high-throughput production workloads and may be significantly slower than mature libraries like PCRE2.
  • Simplified Unicode Properties: The \p{...} property support is comprehensive but uses a simplified, non-exhaustive set of Unicode data. A production-ready version would require data generated from the full Unicode Character Database (UCD).

Contributions to address these points are highly welcome!


Key Features

  • PCRE2-Compatible Engine: A complete engine that compiles patterns and matches strings.
  • Two-Stage Compilation:
    1. Parser: A robust recursive-descent parser builds a detailed Abstract Syntax Tree (AST) from the pattern, resolving forward references for constructs like \k<name>.
    2. Compiler: The AST is then compiled into a compact and efficient bytecode format, where capturing groups become callable subroutines.
  • Backtracking NFA Virtual Machine: The engine uses a custom, non-recursive, stack-based VM to execute the bytecode, providing powerful and correct matching behavior while avoiding C stack overflow.
  • Extensive Syntax Support: Implements a wide array of PCRE2/Perl constructs, including atomic groups, possessive quantifiers, advanced assertions, conditionals, and subroutines.
  • Structured Error Reporting: The compile function provides a detailed regex_err object with an error code, message, and the exact position (line, col) of the error in the pattern.
  • Pluggable Memory Allocators: The entire library can be configured to use custom malloc, realloc, and free functions for specialized memory environments.
  • Clean, Opaque API: The compiled pattern is managed through an opaque regex_compiled* handle, hiding internal complexity.
  • Unicode-Aware: Correctly parses UTF-8 patterns, Unicode escape sequences (\x{...}), and Unicode properties (\p{L}).
  • Zero Dependencies: Written in pure C99 with no external library dependencies.

Architecture Overview

The engine operates in three distinct stages:

  1. Parsing (regex-parser.c): The input pattern string is parsed into an Abstract Syntax Tree (AST). This tree represents the logical structure of the regular expression. This stage validates syntax and reports errors.
  2. Compilation (regex-match.c): The AST is traversed by a compiler which emits a linear sequence of bytecode instructions. This bytecode is a low-level representation of the matching logic, optimized for the VM.
  3. Execution (regex-match.c): The generated bytecode is executed by a lightweight, stack-based virtual machine (VM). The VM performs the actual matching against the subject string using a backtracking NFA algorithm, managing alternative paths on an explicit stack.

Usage Example

Here is a simple example of how to compile a pattern, execute a match, and inspect the results.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "regex-parser.h"

int main(void) {
    const char* pattern = "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})";
    const char* subject = "The date is 2023-11-28.";

    regex_err error = {0};
    
    // 1. Compile the pattern
    regex_compiled* rx = regex_compile(pattern, 0, &error);

    if (!rx) {
        fprintf(stderr, "Failed to compile pattern at position %d: %s\n", 
                error.pos, error.msg);
        return 1;
    }
    
    printf("Successfully compiled pattern: \"%s\"\n", pattern);

    // 2. Execute the match
    regex_match_result result = {0};
    int match_status = regex_match(rx, subject, strlen(subject), &result);

    if (match_status == 1) { // 1 indicates a successful match
        printf("Match found from index %d to %d!\n", result.match_start, result.match_end);
        printf("  - Full match: '%.*s'\n", 
               (int)(result.match_end - result.match_start), 
               subject + result.match_start);

        // Capture groups are 1-indexed for users, but 0-indexed in the result arrays.
        for (int i = 0; i < result.capture_count; i++) {
            if (result.capture_starts[i] != -1) {
                printf("  - Group %d: '%.*s'\n", i + 1,
                       (int)(result.capture_ends[i] - result.capture_starts[i]),
                       subject + result.capture_starts[i]);
            }
        }
        
        // 3. Free the match result's internal arrays
        regex_free_match_result(&result, NULL);

    } else if (match_status == 0) {
        printf("No match found.\n");
    } else {
        // A return value (other than 1 and 0) is a runtime error code
        fprintf(stderr, "Execution error: %s\n", regex_error_message(match_status));
    }

    // 4. Free the compiled regex object
    regex_free(rx);

    return 0;
}

API Reference

Core Structures

  • regex_compiled: An opaque pointer representing a compiled regular expression.
  • regex_err: A struct containing detailed error information (code, pos, line, col, msg).
  • regex_match_result: A struct containing the results of a successful match, including overall match bounds and arrays of capture group bounds.
  • regex_allocator: A struct allowing you to provide custom memory allocation functions.

Main Functions

  • regex_compiled* regex_compile(const char* pattern, uint32_t flags, regex_err* error)
    Compiles a null-terminated pattern string using standard library allocators. On success, returns an opaque regex_compiled* handle. On failure, returns NULL and populates the error struct.

  • regex_compiled* regex_compile_with_allocator(const char* pattern, uint32_t flags, const regex_allocator* allocator, regex_err* error)
    Same as regex_compile but uses a custom regex_allocator.

  • void regex_free(regex_compiled* rx)
    Frees all memory associated with a compiled regex handle.

  • int regex_match(regex_compiled* rx, const char* subject, size_t subject_len, regex_match_result* result)
    Executes a compiled regex rx against a subject string.

    • Returns 1 on a successful match and populates the result struct.
    • Returns 0 if no match is found.
    • Returns a positive REGEX_ERR_* code on a runtime error (e.g., REGEX_ERR_RECURSION_LIMIT, REGEX_ERR_MEMORY).
  • void regex_free_match_result(regex_match_result* result, const regex_allocator* allocator)
    Frees the capture group arrays within a regex_match_result struct. Pass NULL for the allocator to use the standard library free.

Utility Functions

  • const char* regex_error_message(int error_code)
    Returns a static, human-readable string for a given REGEX_ERR_* code.

  • void print_regex_ast(const regex_compiled* rx)
    A debugging utility to print a human-readable representation of the internal AST to standard output.

Compilation Flags

  • REG_IGNORECASE: Perform case-insensitive matching.
  • REG_MULTILINE: ^ and $ match the start/end of lines, not just the start/end of the string.
  • REG_SINGLELINE: . (dot) metacharacter matches all characters, including newlines.
  • REG_EXTENDED: Ignore unescaped whitespace and comments starting with #.
  • REG_UNGREEDY: Invert the greediness of quantifiers (e.g., * becomes lazy, *? becomes greedy).

Supported Regex Syntax

The engine supports a wide subset of PCRE2/Perl syntax.

Basic Elements

  • Literal characters (full Unicode support)
  • . (dot metacharacter, respects REG_SINGLELINE)
  • Character classes [abc], [^abc], [a-z]
  • Pre-defined classes: \d, \D, \w, \W, \s, \S
  • Anchors: ^, $, \A, \z, \b (word boundary), \B

Quantifiers

  • Greedy: *, +, ?, {n,m}
  • Lazy (non-greedy): *?, +?, ??, {n,m}?
  • Possessive: *+, ++, ?+, {n,m}+

Groups

  • Capturing groups: (...)
  • Non-capturing groups: (?:...)
  • Named capturing groups: (?<name>...), (?'name'...)
  • Atomic groups: (?>...)
  • Branch-reset groups: (?|...)

Assertions

  • Positive Lookahead: (?=...)
  • Negative Lookahead: (?!...)
  • Positive Lookbehind: (?<=...) (fixed-length only)
  • Negative Lookbehind: (?<!...) (fixed-length only)

Backreferences & Subroutines

  • Numbered backreferences: \1, \2, etc.
  • Named backreferences: \k<name> or \k'name'
  • Numbered subroutine calls: (?1), (?2), etc.
  • Named subroutine calls: (?&name)
  • Full pattern recursion: (?R)

Conditionals

  • Numeric condition: (?(1)yes|no)
  • Named condition: (?(<name>)yes|no) or (?('name')yes|no)
  • Assertion condition: (?(?=...)yes|no)

Modifiers & Unicode

  • Inline flags: (?i), (?-m)
  • Scoped flags: (?i:...)
  • Unicode properties: \p{L}, \P{Sc}
  • Unicode escapes: \x{...}, \xAB, \uABCD
  • POSIX character classes: [[:alpha:]], [[:digit:]], etc. (Unicode-aware)

Other

  • Comments: (?#...)
  • Quoted sequences: \Q...\E

Limitations & Unsupported Syntax

The engine is comprehensive but does not yet implement every feature of PCRE2. The following limitations currently apply:

  • Lookbehind Assertions: Must be fixed-length. Variable-length lookbehind (e.g., (?<=a*)) is not supported. The maximum lookbehind length is 255 characters.
  • Unicode Support:
    • The built-in Unicode property (\p{...}) and POSIX class ([[:alpha:]]) support is based on a partial character database and does not cover all Unicode scripts or categories.
    • Grapheme cluster matching (\X) and script run matching are not supported.
  • Backreferences & Subroutines:
    • The \g{...} backreference/subroutine syntax is not supported. Use the alternative forms \k<name>, (?n), or (?&name) instead.
    • Recursion/subroutine call depth is limited to 32 (MAX_CALL_DEPTH).
  • Control Verbs: Verbs that control the backtracking engine, such as (*SKIP), (*FAIL), (*ACCEPT), (*COMMIT), etc., are not supported.
  • Callouts: The (?C...) callout feature for interacting with external C code during a match is not implemented.
  • Generic Newline Sequences: The \R shortcut for matching generic newline sequences is not supported. Use an explicit alternation like \r?\n or a character class [\r\n] instead.
  • Engine Optimizations: No AST or bytecode optimization passes (e.g., peephole optimizations) are currently performed, which may impact performance on complex patterns.

Building

The project uses CMake and has no external dependencies. You will need a C99-compatible compiler (e.g., GCC, Clang, MSVC) and CMake (version 3.10 or later).

Standard Build

Follow these steps to build the library and the test suite from the command line:

# 1. Create a directory for an out-of-source build
mkdir build
cd build

# 2. Configure the project using CMake
# On Linux/macOS with Makefiles (default)
cmake ..

# On Windows, you might generate a Visual Studio solution
# cmake .. -G "Visual Studio 17 2022"

# 3. Compile the project
# This universal command works with most generators (Makefiles, Ninja, etc.)
cmake --build .

This process will generate the following in your build directory:

  • Shared Library: librex_ast.dll on Windows, or liblibrex_ast.so/liblibrex_ast.dylib on Unix-like systems.
  • Static Library: librex_ast.lib on Windows, or liblibrex_ast.a on Unix-like systems.
  • Test Executable: regex_test (or regex_test.exe on Windows).

Running Tests

After a successful build, you can run the test suite from the build directory:

# On Linux/macOS
./regex_test

# On Windows
.\regex_test.exe

Build Options

You can customize the build with CMake options passed during the configuration step.

  • ENABLE_WARNINGS: Toggles extra compiler warnings (e.g., -Wall -Wextra). It is ON by default.

To build with this option disabled:

cmake .. -DENABLE_WARNINGS=OFF

Roadmap & Contributing

This project is an excellent platform for exploring regex engine design. Contributions are welcome to make it more robust and performant. High-priority areas include:

  • Performance Optimization: Profile and optimize the VM loop, reduce memory allocations during matching, and implement bytecode optimizations.
  • Full Unicode Support: Integrate code generation scripts to build comprehensive property, script, and character-folding tables from the official Unicode Character Database (UCD).
  • JIT Compilation: Explore adding a Just-In-Time (JIT) compiler backend (e.g., targeting x86-64) for performance-critical patterns.
  • CI & Testing: Expand the test suite for more comprehensive coverage and set up continuous integration checks.

Please feel free to open an issue to discuss a new feature or submit a pull request.

License

This project is licensed under the MIT License.

Author

Mounir IDRASSI mounir.idrassi@amcrypto.jp

About

A feature-rich, zero-dependency, PCRE2-compatible regular expression engine in C99, featuring a parser, bytecode compiler, and a backtracking NFA virtual machine.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published