Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Token detection #80

Closed
neophob opened this issue Dec 17, 2017 · 4 comments
Closed

Improve Token detection #80

neophob opened this issue Dec 17, 2017 · 4 comments

Comments

@neophob
Copy link

neophob commented Dec 17, 2017

token detection can be improved, here's a sample output of my project:

--- One extra occurrence needed for a gain --
   val=7, gain=0->1 (+1), N=4, str = 16
   val=7, gain=0->1 (+1), N=4, str = (g
   val=7, gain=0->1 (+1), N=4, str = +d
   val=7, gain=0->1 (+1), N=4, str = ]=
   val=7, gain=0->1 (+1), N=4, str = ,u
   val=7, gain=0->1 (+1), N=4, str = *s
   val=8, gain=-1->1 (+2), N=2, str = var
   val=8, gain=-1->1 (+2), N=2, str = ar 
   val=8, gain=-1->1 (+2), N=2, str = if(
   val=8, gain=-1->1 (+2), N=2, str = /51
   val=8, gain=-1->1 (+2), N=2, str = )%5
   val=8, gain=-1->1 (+2), N=2, str = 16,
   val=8, gain=-1->1 (+2), N=2, str = set
   val=8, gain=-1->1 (+2), N=2, str = 2e3
   val=8, gain=-1->1 (+2), N=2, str = *s,
   val=19, gain=0->3 (+3), N=2, str = var 

this means:

  • var and var exist -> should be combined as var - maybe combine it even with ar
  • 16 and 16, exist -> should be combined as 16
  • ditto for *s, and *s
@Siorki
Copy link
Owner

Siorki commented Dec 20, 2017

That output is produced by feature #48 introduced in v5.0.
Those substrings are candidates patterns- meaning that they were considered to be replaced by 1-character tokens, but not selected since that would come at a loss, or just break even.

This section is intended as a help for the user to improve compression. For instance, if you have s* somewhere in your code, you could swap the operands to have an extra instance of *s which would then get compressed and gain 1 byte.

The line giving you this information is :
val=7, gain=0->1 (+1), N=4, str = *s
In detail, it means that there are 4 instances of *s, and that replacing them with a token is useless (gain = 0). However if you could have an extra instance of the same substring, the compression would then bring a net gain of 1 byte.

As for merging 16 and 16,, your mileage may vary : you have 4 instances of the former and only 2 of the latter, so keeping the shortest one may or may not be the optimal solution. The gain upon packing depend on both string length and occurrence count, you can influence the algorithm by tweaking the score formula (default 1/0/0).

@neophob
Copy link
Author

neophob commented Dec 20, 2017

thanks @Siorki now that make sense!

about the formula - today i use a script which produce several regpack outputs with different settings. what about a brute force attempt to find the best compression? think thats feasible?

@Siorki
Copy link
Owner

Siorki commented Dec 22, 2017

Feasible yet difficult - that's exactly the point of issue #7

@neophob
Copy link
Author

neophob commented Dec 23, 2017

thanks!

@neophob neophob closed this as completed Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants