-
Notifications
You must be signed in to change notification settings - Fork 14
Crusher
#Crusher
##Principle
The crusher performs lossless compression, building a custom dictionary from the input string, then replacing its entries with tokens. Its implementation is based on the algorithms developed by Aivo Paas for JsCrush and Tim Down for First Crush that behave like byte-pair encoding, but with strings of any size.
##Inner clockwork
###List all tokens
An initial token list is built, containing all ASCII characters in the range [1-126] that are not present in the input string, with the exception of the following characters that are banned from becoming tokens :
- 10
<LF>
and 13<CR>
that cannot appear unescaped in a string literal - among 34
"
, 39'
and 96`
, the one [reserved to wrap the packed string](Module - quote strings) is excluded, the other two are allowed as tokens. - 92
\
requires escaping, so while it could be used, it would be a 2-byte token
###Build the dictionary
Two loops iterate on all characters in the string, and on all possible substring lengths. Each substring is recorded in the dictionary, along with its number of occurrences (if 2 or more). The algorithm is in O(n²), so is the memory footprint, which causes issue #46.
This search is performed only once at the beginning. Then each time a dictionary entry is assigned to a token, all remaining entries are checked, and removed if they are no longer present inside the modified input.
###Replace dictionary entries with tokens
The main loop of the crusher iterates until the token list is exhausted, or nothing can be compressed further. Each dictionary entry has its score and gain computed :
Gain is the number of bytes gained by replacing it with a token. All occurrences of the dictionary entry are replaced by a token (1 byte), then the token and the string of the entry are appended. The token will be listed again in the unpacking routine. Thus the gain is :
gain = nb_occurrences * (entry_length - 1) /* replacement */ - (1 + entry_length) /* replacement table */ - 1 /* token list */
= nb_occurrences * entry_length - nb_occurrences - entry_length - 2
Score is computed from the gain, the entry length in bytes, the number of occurrences and factors given by the user as parameters.
score = gain_factor * gain + length_factor * entry_length + copies_factor * nb_occurrences
The dictionary entry with the highest score is then assigned to the next token and replaced in the string. The token and entry are then appended to the string. In case of a tie for the score, gains are compared. If gains are identical, the tiebreaker defined by the user (most occurrences or longest string) applies.
###Add the unpacking routine
The unpacking routine is appended to the string. It performs the inverse operations :
- loops on all the tokens, explicitely listed in the string. If ES6 is enabled, a
for .. of
loop is used instead offor .. in
, gaining 6 bytes. - breaks the string on each occurrence of the token. Since the token and dictionary entry were appended to the string after the replacement, the last substring is the entry
- rebuilds the string with the entry replacing the token
Once all replacements are done, the original string is rebuilt. The routine then performs
eval()
on it. (or a variant thereof, if a preprocessor module has modified the behavior).
##Options
Parameter name | Type | Default | Effect |
---|---|---|---|
crushGainFactor |
integer | 1 |
gain factor in score computation |
crushLengthFactor |
integer | 0 |
length factor in score computation |
crushCopiesFactor |
integer | 0 |
occurrence count factor in score computation |
crushTiebreakerFactor |
integer | 1 |
tiebreaker : >0 for occurrences, <0 for length |
useES6 |
boolean | true |
enable or disable ES6 features |
##Implementation
The crusher is implemented by the method RegPack.findRedundancies()
.
The code is derived from the original implementation of JsCrush by Aivo Paas.