-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support escaped characters in character class #47
Comments
] disallowed as token in second stage. Added details to prepare for #47
In addtion to \ and ] (the two escaped characters in a character class), this is also an opportunity to use quotes (" or ') as tokens. They are currently (up to 4.0.1) considered illegal as tokens. One of them could be used, if and only if the code contains neither of them. One could be turned into a token, and RegPack will wrap the packed string with the other one (remember : if both are present, instances of the wrapping quote inside the code must be escaped, resulting in lost bytes). => Added as #55 |
Extra requirement for the token range choice algorithm : as a tiebreaker, prefer "readable" characters (32-126) over control characters (1-31) |
This turns out to be way more complex than expected :
|
- replace ] and \ by unescaped characters if the last range has leftovers - avoid ^ at the beginning of the first range - crusher records strings that were not compressed for lack of a token (packer has one more) - side impact on PatternViewer for recorded uncompressed strings
After running the whole set of benchmarks, it turns out that this feature is rarely triggered, if ever. Additionnally, a shortage in tokens is an uncommon sight. It happens exactly once (with Flappy Dragon Classic) when an already-compressed string occupies most of the ASCII-space, including our special characters So, no improvement on the benchmark so far. |
Characters 92 \ and 93 ] need escaping in character class.
Currently, and following #45, those caracters cannot begin nor end a range in a character class. They are basically ignored, and the range instead ends in 91 [ or begins with 94 ^.
It might however be useful to include them to gain extra tokens, even at the cost of one extra byte: a range with an escaped character such as
[\\-c]
costs 4, instead of 3 for a non-escaped range[Z-c]
.Ranges are current sorted longest to shortest, and tokens taken from them in that order. The sorting criteria will need to be reconsidered to account for the variations in cost (3 or 4 bytes), maybe a gain/cost ratio, or an algorithm to find as many tokens as required at the minimal cost.
The text was updated successfully, but these errors were encountered: