Improve Token detection #80

neophob · 2017-12-17T22:06:45Z

token detection can be improved, here's a sample output of my project:

--- One extra occurrence needed for a gain --
   val=7, gain=0->1 (+1), N=4, str = 16
   val=7, gain=0->1 (+1), N=4, str = (g
   val=7, gain=0->1 (+1), N=4, str = +d
   val=7, gain=0->1 (+1), N=4, str = ]=
   val=7, gain=0->1 (+1), N=4, str = ,u
   val=7, gain=0->1 (+1), N=4, str = *s
   val=8, gain=-1->1 (+2), N=2, str = var
   val=8, gain=-1->1 (+2), N=2, str = ar 
   val=8, gain=-1->1 (+2), N=2, str = if(
   val=8, gain=-1->1 (+2), N=2, str = /51
   val=8, gain=-1->1 (+2), N=2, str = )%5
   val=8, gain=-1->1 (+2), N=2, str = 16,
   val=8, gain=-1->1 (+2), N=2, str = set
   val=8, gain=-1->1 (+2), N=2, str = 2e3
   val=8, gain=-1->1 (+2), N=2, str = *s,
   val=19, gain=0->3 (+3), N=2, str = var

this means:

var and var exist -> should be combined as var - maybe combine it even with ar
16 and 16, exist -> should be combined as 16
ditto for *s, and *s

The text was updated successfully, but these errors were encountered:

Siorki · 2017-12-20T22:27:31Z

That output is produced by feature #48 introduced in v5.0.
Those substrings are candidates patterns- meaning that they were considered to be replaced by 1-character tokens, but not selected since that would come at a loss, or just break even.

This section is intended as a help for the user to improve compression. For instance, if you have s* somewhere in your code, you could swap the operands to have an extra instance of *s which would then get compressed and gain 1 byte.

The line giving you this information is :
val=7, gain=0->1 (+1), N=4, str = *s
In detail, it means that there are 4 instances of *s, and that replacing them with a token is useless (gain = 0). However if you could have an extra instance of the same substring, the compression would then bring a net gain of 1 byte.

As for merging 16 and 16,, your mileage may vary : you have 4 instances of the former and only 2 of the latter, so keeping the shortest one may or may not be the optimal solution. The gain upon packing depend on both string length and occurrence count, you can influence the algorithm by tweaking the score formula (default 1/0/0).

neophob · 2017-12-20T22:31:17Z

thanks @Siorki now that make sense!

about the formula - today i use a script which produce several regpack outputs with different settings. what about a brute force attempt to find the best compression? think thats feasible?

Siorki · 2017-12-22T22:50:10Z

Feasible yet difficult - that's exactly the point of issue #7

neophob · 2017-12-23T12:14:43Z

thanks!

neophob closed this as completed Dec 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Token detection #80

Improve Token detection #80

neophob commented Dec 17, 2017

Siorki commented Dec 20, 2017

neophob commented Dec 20, 2017

Siorki commented Dec 22, 2017

neophob commented Dec 23, 2017

Improve Token detection #80

Improve Token detection #80

Comments

neophob commented Dec 17, 2017

Siorki commented Dec 20, 2017

neophob commented Dec 20, 2017

Siorki commented Dec 22, 2017

neophob commented Dec 23, 2017