Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support escaped characters in character class #47

Closed
Siorki opened this issue Feb 24, 2016 · 4 comments
Closed

Support escaped characters in character class #47

Siorki opened this issue Feb 24, 2016 · 4 comments
Assignees
Milestone

Comments

@Siorki
Copy link
Owner

Siorki commented Feb 24, 2016

Characters 92 \ and 93 ] need escaping in character class.
Currently, and following #45, those caracters cannot begin nor end a range in a character class. They are basically ignored, and the range instead ends in 91 [ or begins with 94 ^.

It might however be useful to include them to gain extra tokens, even at the cost of one extra byte: a range with an escaped character such as [\\-c] costs 4, instead of 3 for a non-escaped range [Z-c].

Ranges are current sorted longest to shortest, and tokens taken from them in that order. The sorting criteria will need to be reconsidered to account for the variations in cost (3 or 4 bytes), maybe a gain/cost ratio, or an algorithm to find as many tokens as required at the minimal cost.

@Siorki Siorki added this to the 5.0 milestone Feb 24, 2016
@Siorki Siorki self-assigned this Feb 24, 2016
Siorki added a commit that referenced this issue Feb 26, 2016
] disallowed as token in second stage. Added details to prepare for #47
@Siorki
Copy link
Owner Author

Siorki commented Feb 26, 2016

In addtion to \ and ] (the two escaped characters in a character class), this is also an opportunity to use quotes (" or ') as tokens. They are currently (up to 4.0.1) considered illegal as tokens.

One of them could be used, if and only if the code contains neither of them. One could be turned into a token, and RegPack will wrap the packed string with the other one (remember : if both are present, instances of the wrapping quote inside the code must be escaped, resulting in lost bytes).
Both can be included in the character class (to increase block size), but only one will be used (suggestion : '), the other one (") being considered the same way as character LF(10) and CR(13) : included in the character class, but absent from the code and not used as tokens.

=> Added as #55

@Siorki
Copy link
Owner Author

Siorki commented Mar 5, 2016

Extra requirement for the token range choice algorithm : as a tiebreaker, prefer "readable" characters (32-126) over control characters (1-31)

@Siorki
Copy link
Owner Author

Siorki commented Nov 21, 2016

This turns out to be way more complex than expected :

  • using escaped characters should not degrade the compression rate : they should only be used if there is a net gain in bytes - if there are still strings to pack but no tokens left. Alternate route : use them normally, but replace them with leftover tokens, if any
  • phasing escaped characters out of ranges can result in a range starting with ^(94) which should not be allowed at the beginning of a character class, as it will be interpreted as a negated class
  • the crusher only outputs data related to strings actually replaced. With \ and ], the packer has access to one extra token, so it needs information about strings that were not replaced for a lack of tokens

Siorki added a commit that referenced this issue Nov 23, 2016
 - replace ] and \ by unescaped characters if the last range has leftovers
 - avoid ^ at the beginning of the first range
 - crusher records strings that were not compressed for lack of a token (packer has one more)
 - side impact on PatternViewer for recorded uncompressed strings
@Siorki
Copy link
Owner Author

Siorki commented Nov 23, 2016

After running the whole set of benchmarks, it turns out that this feature is rarely triggered, if ever.
Most js1k entries use arrays, and therefore include characters [ and ] in the code, leaving \ alone in the middle. As a single-character range with a cost of 2 bytes, it ranks last when tokens are ordered.

Additionnally, a shortage in tokens is an uncommon sight. It happens exactly once (with Flappy Dragon Classic) when an already-compressed string occupies most of the ASCII-space, including our special characters \ and ], meaning they won't be turned into tokens either.

So, no improvement on the benchmark so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant