Skip to content

Regular expression lib; portable across Java variants (including GWT)

License

Notifications You must be signed in to change notification settings

tommyettinger/RegExodus

Repository files navigation

RegExodus

Regular expression lib; portable across Java variants, including GWT

The Problem

Java applications using libraries like LibGDX can target multiple platforms with minimal changes to the codebase... most of the time. Targeting HTML via Google Web Toolkit, or GWT, involves using a subset of Java's standard library, one that does not include the java.util.regex package and has only a few methods that take Strings to be interpreted as semi-compatible regular expressions. These methods, like String.matches(String), use Java syntax for regular expressions on most targets, but use JavaScript syntax on HTML via GWT. This incompatibility is particularly painful in regard to Unicode, where JS is rather crippled compared to Java's fully-fledged understanding of Unicode.

Matching any letter seems easy enough with [A-Za-z] until the need to match letters in French, German, Hebrew, and more comes up as the application finds I18N necessary. Then, if you need to perform case-insensitive matching, things get even more troubling with naive solutions... There needs to be a better way.

A Solution

While working on the SquidLib game development library, several useful classes needed to be marked as incompatible with GWT due to the lack of a useful regular expression implementation that works cross-platform. I set out to find a pure-Java regular expression engine that could run without a serious speed loss on desktop and mobile platforms but could still work on GWT.

I found JRegex, a project by Sergey A. Samokhodkin that was last substantially updated in 2002, and decided to modernize it (using generics in collections that warn without it, using the newer HashMap and ArrayList instead of HashTable and Vector for better single-threaded performance, and so on). JRegex, at first glance, appeared to meet all the criteria I initially had, and now that it has been modernized, its speed is reasonable for less-intensive usages of regular expressions (when matching or replacing on desktop, expect 0.3x to 0.5x the rate of java.util.regex , probably never faster than the normal regular expressions on desktop but always better on GWT when compared with not having an implementation at all), and it is essentially compatible with a superset of the java.util.regex API. The downside was, it originally used the Unicode character database that ships with Java... except on GWT. With some tricky code to minimize file sizes that encodes a bitset with a small int array and a String by gagern using the Node.js Unicode database, I managed to get the full Unicode 13.0.0 category information for the Basic Multilingual Plane (and later, case folding information) in a single small-ish file of Java code. The compression code is not in the distributed jar of source, but is in etc/generator.js , and the end result is distributed in src/main/java/regexodus/Category.java (which also has case folding information, and uses code adapted from libGDX by way of jdkgdxds). Now RegExodus acts like an updated version of JRegex that carries much of Unicode with it, in a jar no more than 1/7 of a megabyte in size (currently). Though testing so far has been light, it seems to be fully compatible with GWT, in development or production mode.

The name RegExodus comes from both the idea of taking Java regular expressions and letting them free to roam various platforms, and because The Ten Commandments was on TV when I was thinking of names for the project.

Usage

Code-wise, usage should be transparent or require minimal changes if porting from java.util.regex code like Pattern and Matcher; just change the package from java.util.regex.Pattern to regexodus.Pattern, or use the new-in-0.1.6 regexodus.regex package that copies java.util.regex's API more closely. It is possible that GWT's option for "super-sourced" packages to replace unimplemented parts of the JRE may work here to imitate an implementation of java.util.regex with a close approximation, but it hasn't been attempted. Super-sourcing won't be completely compatible at the moment, but is likely to work at least reasonably well with regexodus.regex .

Some usage will be easier if you can fully embrace RegExodus' style of regular expressions, and the classes that use them. The Replacer class has a different API than java.util.regex offers, and you can implement the Substitution interface for more-involved replacements. The Category class has useful Unicode 13.0.0 info that isn't especially easy to get from the JDK, and is great when you want to evaluate if a particular character is, say, a lower-case letter. The documentation for the regex flavor available here is mostly in the class JavaDocs for Pattern.

Installation should be simple if you use a build tool like Maven, Gradle, or the like. For version or snapshot releases you can use JitPack (this repository is recommended if you want snapshots) and Maven Central is an easy alternative for version releases if you aren't able to add a third-party repository. JitPack instructions for common build tools are here, and Maven Central instructions for more build tools are here; the 0.1.16 release is preferred for now, based on the 1.2 line of JRegex. You can also download pre-built jars from the GitHub Releases page, or build from source; this has no dependencies other than JUnit for tests.

Changelog

0.1.2 adds support for a missing Java regex feature, \Q...\E literal sections. It also fixes some not-insignificant issues with features not present in Java's regex implementation, like an array index bug involving \m..., where those character escapes with base-10 numbers could check outside the input string and crash if the escape was at the end of a pattern.

0.1.3 fixes a bug in case-insensitive matching where it would previously only match lower-case text if case-insensitive mode was on. Now it correctly matches both "A" and "a" if given either Pattern.compile("A", "i") or Pattern.compile("a", "i"). This was thought to have been tested, but the test wasn't very good and this behavior may have persisted through several releases.

0.1.4 fixes a nasty bug that broke many long ranges in a character class (spanning between Unicode blocks) where character ranges weren't always what they claimed to be. If you use earlier than 0.1.4, updating is strongly recommended to this or any more recent version.

0.1.5 adds additional features to backreferences and replacement, making certain replacement-based operations much more convenient, like iterative replacement that only replaces one match at a time. It also enhances backreferences so you could require that an already-captured group be followed by that same group in reverse character order ("cat" could be required to be followed by "tac"), among other features like locally-case-insensitive backreferences, or even mirrored brackets (if one of "(" or "{" was captured, you could require the backreference to be the correctly matching ")" or "}") for most of the Unicode brackets.

0.1.6 adds an additional compatibility mode for Java regex compatibility, with the new regexodus.regex package that can be swapped in as a mostly-complete replacement for java.util.regex on platforms that don't have it. It also adds some additional pseudo-Unicode categories for matching the tricky rules that govern valid Java identifiers: Js for the start of a Java identifier, and Jp for any subsequent part of a Java identifier. These can be used to match a complete Java identifier with Pattern.compile("\\p{Js}\\p{Jp}*"). A convenience class, ChanceSubstitution, allows an easier way to randomize the times when a replacement is actually performed, leaving the match unchanged otherwise. Matcher.foundStrings is a simple wrapper around the new MatchIterator.asList, which both allow you to get all matching portions of a String as a List of Strings, even if there are no groups in the Matcher's Pattern.

0.1.7 fixes a bug when getting a String from a Pattern that could (and often did, when debugging or serializing to text) overflow the stack. It also adds two new methods to make serializing Patterns easier, and allows you to retrieve the flags from a Pattern. The bug fixed was relatively severe under some circumstances, so updating is recommended.

0.1.8 had serious issues on GWT and has been replaced by 0.1.16.

0.1.9 improves GWT compatibility and adds the Unicode-like categories for horizontal, vertical, and all whitespace as Gh, Gv, and G, respectively (think G for Gap). These whitespace Category values include characters that are conspicuously absent from the Unicode Z categories, such as tabs and all newline characters in current use (\t, \r and \n are all in control categories instead of whitespace under Z). The GWT compatibility changes entailed a package change, taking regexodus.regex and moving it to emu.java.util.regex, but this allows third-party libraries to use the normal Java regex API via GWT's super-source mechanism and have it call RegExodus' shim layer instead, transparently. There are possible issues if other libraries also super-source to implement java.util.regex; libGDX does this and there are probably others out there. I'm not sure what takes precedence in that case, but it seems to work so far in basic GWT testing (SuperDev mode).

0.1.10 fixes compatibility with GWT 2.8.2 and lets the \p{InBasicLatin} and \P{Greek} types of Unicode block matchers work (for the first time, possibly?). It also updates Unicode Standard compatibility to 11.0.0, though only for the Basic Multilingual Plane.

0.1.11 fixes a 20-year-old bug in Matcher.setTarget(CharSequence, int, int) that affected any targets with a non-zero start. It replaces the utility data structure CharCharMap with an implementation from jdkgdxds, which is largely the same as libGDX's style, and fixes a few long-standing bugs in the old version. It removes the utility data structure CharArrayList because it was completely unnecessary here. There's an option in replacements to upper-case a group captured from the search string, which rounds out the previous lower-casing option. Finally, the Unicode data has been updated to 13.0.0.

0.1.12 is mostly a minor update, but fixes a bug in case mapping where certain chars would have strangely-incorrect results for upper-case or lower-case conversions, like Category.caseUp('s') returned 'ſ' (the 1700s-era long-S). Some actually-useful parts of PerlSubstitution are now public and documented, where before they were only usable if reading the RegExodus sources. There's also some cleanup on internals, which may help with debugging.

0.1.13 fixes some long-standing usually-minor issues with the equals() method on Pattern and Term (it was extremely rarely used, but could enter an infinite loop). It makes sure Pattern compares the flags, such as case-insensitive mode or ignore-whitespace mode, as part of the Pattern's equality. This release also matches Java's behavior with the \\G escape at the start of text.

0.1.14 shouldn't be used, because the java.util.regex replacement for GWT has an error in it. This prevents compilation on GWT, so use 0.1.16 instead.

0.1.15 includes some GWT fixes and changes the inherits line you need to:

<inherits name='regexodus.regexodus' />

It also fixes the behavior of Matcher in some cases, and Category for blocks and non-BMP letters (which aren't supported, just now officially).

0.1.16 fixes an issue with state pollution when a Matcher was reused, even when done correctly. The only other change should be Matcher.groupv() being documented and using the empty String "" where there is no match for a group, instead of the String "empty" that it used before.

Credit

This is a modified fork of JRegex, by Sergey A. Samokhodkin, meant to improve compatibility with Android and GWT. This builds off Ed Ropple's work to make JRegex Maven-friendly. This fork started with Ed Ropple's copy of jregex 1.2_01 (available on GitHub). In addition, portions of this code use modified versions of the collections from jdkgdxds (in the regexodus.ds package, CharCharMap is derived from jdkgdxds, which is derived from libGDX). Significant work by the team responsible for the Node.js Unicode database is invaluable here, especially gagern for creating the compression technique that RegExodus uses on Unicode category data.

You can get the original jregex at: http://sourceforge.net/projects/jregex

License

3-Clause BSD. See the file LICENSE in this directory for details.