Support escaped characters in character class #47

Siorki · 2016-02-24T22:00:04Z

Characters 92 \ and 93 ] need escaping in character class.
Currently, and following #45, those caracters cannot begin nor end a range in a character class. They are basically ignored, and the range instead ends in 91 [ or begins with 94 ^.

It might however be useful to include them to gain extra tokens, even at the cost of one extra byte: a range with an escaped character such as [\\-c] costs 4, instead of 3 for a non-escaped range [Z-c].

Ranges are current sorted longest to shortest, and tokens taken from them in that order. The sorting criteria will need to be reconsidered to account for the variations in cost (3 or 4 bytes), maybe a gain/cost ratio, or an algorithm to find as many tokens as required at the minimal cost.

The text was updated successfully, but these errors were encountered:

] disallowed as token in second stage. Added details to prepare for #47

Siorki · 2016-02-26T20:21:19Z

In addtion to \ and ] (the two escaped characters in a character class), this is also an opportunity to use quotes (" or ') as tokens. They are currently (up to 4.0.1) considered illegal as tokens.

One of them could be used, if and only if the code contains neither of them. One could be turned into a token, and RegPack will wrap the packed string with the other one (remember : if both are present, instances of the wrapping quote inside the code must be escaped, resulting in lost bytes).
Both can be included in the character class (to increase block size), but only one will be used (suggestion : '), the other one (") being considered the same way as character LF(10) and CR(13) : included in the character class, but absent from the code and not used as tokens.

=> Added as #55

Siorki · 2016-03-05T10:54:20Z

Extra requirement for the token range choice algorithm : as a tiebreaker, prefer "readable" characters (32-126) over control characters (1-31)

Siorki · 2016-11-21T21:20:35Z

This turns out to be way more complex than expected :

using escaped characters should not degrade the compression rate : they should only be used if there is a net gain in bytes - if there are still strings to pack but no tokens left. Alternate route : use them normally, but replace them with leftover tokens, if any
phasing escaped characters out of ranges can result in a range starting with ^(94) which should not be allowed at the beginning of a character class, as it will be interpreted as a negated class
the crusher only outputs data related to strings actually replaced. With \ and ], the packer has access to one extra token, so it needs information about strings that were not replaced for a lack of tokens

- replace ] and \ by unescaped characters if the last range has leftovers - avoid ^ at the beginning of the first range - crusher records strings that were not compressed for lack of a token (packer has one more) - side impact on PatternViewer for recorded uncompressed strings

Siorki · 2016-11-23T22:38:04Z

After running the whole set of benchmarks, it turns out that this feature is rarely triggered, if ever.
Most js1k entries use arrays, and therefore include characters [ and ] in the code, leaving \ alone in the middle. As a single-character range with a cost of 2 bytes, it ranks last when tokens are ordered.

Additionnally, a shortage in tokens is an uncommon sight. It happens exactly once (with Flappy Dragon Classic) when an already-compressed string occupies most of the ASCII-space, including our special characters \ and ], meaning they won't be turned into tokens either.

So, no improvement on the benchmark so far.

Siorki added the enhancement label Feb 24, 2016

Siorki added this to the 5.0 milestone Feb 24, 2016

Siorki self-assigned this Feb 24, 2016

Siorki mentioned this issue Feb 24, 2016

Closing bracket ] needs to be escaped in character class #45

Closed

Siorki added a commit that referenced this issue Feb 26, 2016

#45 - no ] in character class

6b5e255

] disallowed as token in second stage. Added details to prepare for #47

Siorki closed this as completed Nov 23, 2016

Siorki mentioned this issue Nov 23, 2016

Crusher phase - list patterns that are "almost" gains #48

Closed

kanaka mentioned this issue Feb 20, 2017

Unpacked source has "in" strings in wrong places. #73

Closed

Siorki mentioned this issue Sep 24, 2018

Don't use "\" as a token if avoiding it makes the output smaller #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support escaped characters in character class #47

Support escaped characters in character class #47

Siorki commented Feb 24, 2016

Siorki commented Feb 26, 2016 •

edited

Loading

Siorki commented Mar 5, 2016

Siorki commented Nov 21, 2016

Siorki commented Nov 23, 2016

Support escaped characters in character class #47

Support escaped characters in character class #47

Comments

Siorki commented Feb 24, 2016

Siorki commented Feb 26, 2016 • edited Loading

Siorki commented Mar 5, 2016

Siorki commented Nov 21, 2016

Siorki commented Nov 23, 2016

Siorki commented Feb 26, 2016 •

edited

Loading