Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundancy lookup : keep only maximal patterns #93

Open
Siorki opened this issue Nov 25, 2018 · 9 comments
Open

Redundancy lookup : keep only maximal patterns #93

Siorki opened this issue Nov 25, 2018 · 9 comments
Assignees
Milestone

Comments

@Siorki
Copy link
Owner

Siorki commented Nov 25, 2018

Issue spun off from first point of #7

selection of the best substrings only, not all redundant strings. "Best" means that any string contained within another one, with the same redundancy count is discarded. To achieve this, always try to extend the string on each side until the copies count goes down

For instance, when performing forward lookup on getData(0,0) :

  • ge : 20 matches
  • get : 20 matches
  • getD : 6 matches
  • getDa : 6 matches
  • getDat : 6 matches
  • getData : 6 matches
  • getData( : 6 matches
  • getData(0 : 2 matches
  • getData(0, : 2 matches
  • getData(0,0 : 1 match (i.e. no other occurrence)

All occurrences of getD (6 of them) are also occurrences of getData(. Said otherwise, all occurrences of getD are followed by ata(. Thus it makes sense to pack, for the same cost of one token, getData( instead of getD. Going further, it makes sense not to keep getD in the candidate patterns. Same goes for ge and a few others and we shall only retain :

  • get : 20 matches
  • getData( : 6 matches
  • getData(0, : 2 matches

This should improve packing speed by testing fewer candidates, and also take care of #46 by lowering the memory footprint. Further testing shall make sure it does not worsen the compression.

@da0ka
Copy link

da0ka commented Dec 16, 2018

below(base64 coded) is test program without hash table. It's fast and compression ratio is almost the same as v5.0.2.

ZnVuY3Rpb24gaW5zZXJ0U29ydChBLGksbCl7CiBmb3IodmFyIGE9aSxjLGosazsrK2k8bDtBW2pdPWMpCglpZigoYz1BW2o9az1pXSk8QVthXSlmb3IoO2o+YTspQVtqXT1BWy0tal07CgllbHNlIGZvcig7YzxBWy0ta107KUFbal09QVtqPWtdfQoKZnVuY3Rpb24gdHFzb3J0KEEsYSxiKXtpZihhPGIpewogaWYoYi1hPDEwKXJldHVybiBpbnNlcnRTb3J0KEEsYSxiKzEpOwogdmFyIGM9QVthXSxkLGk9QVtiXSxqPWEsaz1iLHA9QVthK2I+PjFdOwogYz5pP2M8PXA/cD1jOnA+aXx8KHA9aSk6aTw9cD9wPWk6Yz5wJiYocD1jKTsKIGZvcihpPWE7ajw9azspCglpZigoYz1BW2pdKTxwKWQ9QVtpXSxBW2krK109YyxBW2orK109ZDsKCWVsc2UgaWYoYz5wKUFbal09QVtrXSxBW2stLV09YzsKCWVsc2UgaisrOwogdHFzb3J0KEEsYSwtLWkpO3Rxc29ydChBLCsrayxiKX19CgovLyBleGNsdWRlIG92ZXJsYXBwaW5nIHN1YnN0cmluZwpmdW5jdGlvbiBub3JtYWxpemUoSSxpLGMsZCl7CglpZihjPDMpewoJCWk9SVtpXS1JW2krMV07aWYoaTwwKWk9LWk7aT5kfHxjLS19CgllbHNlewoJCWZvcih2YXIgZj0wLG49YyxIPVtdO2M7KUhbLS1jXT1JW2krY107CgkJaWYobjw0KXsKCQkJaWYoSFsxXT5IWzJdKWk9SFsxXSxIWzFdPUhbMl0sSFsyXT1pOwoJCQlpZihIWzBdPkhbMV0pe2k9SFswXSxIWzBdPUhbMV0sSFsxXT1pOwoJCQlpZihIWzFdPkhbMl0paT1IWzFdLEhbMV09SFsyXSxIWzJdPWl9fQoJCWVsc2UgbjwxMD9pbnNlcnRTb3J0KEgsMCxuKTp0cXNvcnQoSCwwLG4tMSk7CgkJZm9yKGQrKztmPG47YysrKQoJCQlmb3IoaT1kK0hbZl07aT5IWysrZl07KTsKCX1yZXR1cm4gYwp9CnZhciBtaW5zdXA9MjsJLy8gbWluaW11bSBmcmVxdWVuY3kgb2Ygc3Vic3RyaW5nCgovLyBlbnVtZXJhdGluZyBhbGwgc3Vic3RyaW5ncwovLyBAUzogVW5pY29kZSBBcnJheQovLyBASTogSW5kZXggQXJyYXkKLy8gQGk6IHN0YXJ0Ci8vIEBuOiBlbmQKLy8gQGx2OiBsZW5ndGgKZnVuY3Rpb24gc3NtaW5lKFMsSSxpLG4sbHYpewoJaWYobi1pPG1pbnN1cClyZXR1cm47Cgl2YXIgcD1pK01hdGgucmFuZG9tKCkqKG4taSl8MCxyPUlbaV07CglJW2ldPUlbcF0sSVtwXT1yOwoJZm9yKHZhciB0PVNbSVtpXStsdl0sYT1pKzEsYz1uLTEsYj1hLGQ9Yzs7SVtjLS1dPXIpewoJCWZvcig7Yjw9YyYmKHI9U1tJW2JdK2x2XS10KTwxO2IrKykKCQkJcnx8KHI9SVthXSxJW2ErK109SVtiXSxJW2JdPXIpOwoJCWZvcig7Yjw9YyYmKHI9U1tJW2NdK2x2XS10KT49MDtjLS0pCgkJCXJ8fChyPUlbY10sSVtjXT1JW2RdLElbZC0tXT1yKTsKCQlpZihiPmMpYnJlYWs7CgkJcj1JW2JdLElbYisrXT1JW2NdCgl9CglwPWEtaSxyPWItYTtpZihwPHIpcj1wOwoJZm9yKHA9aSxiLT1yO3ItLTtJW2IrK109dCl0PUlbcF0sSVtwKytdPUlbYl07CglwPWQtYyxyPW4rfmQ7aWYocDxyKXI9cDsKCWZvcihwPWIsbi09cjtyLS07SVtuKytdPXQpdD1JW3BdLElbcCsrXT1JW25dOwoJc3NtaW5lKFMsSSxpLGkrPWItYSxsdiksYis9bit+ZDsKCWlmKGItaT49bWluc3VwJiZ+dClsdiYmUy5nZXQoUyxJLGksYi1pLGx2KSxzc21pbmUoUyxJLGksYixsdisxKQoJc3NtaW5lKFMsSSxuLWQrYyxuLGx2KQp9CgpmdW5jdGlvbiBzaXplRnVuYyhzKXsKCXZhciBhPS9bXHg4MC1cdTA3ZmZdLy50ZXN0KHMpLGI9L1tcdTA4MDAtXHVGRkZGXS8udGVzdChzKTsKCWlmKGEmJmIpYT1mdW5jdGlvbihBLGksbCxiLGMpewoJCWZvcihiPTA7bDtiKz1jPDEyOD8xOmM8MjA0OD8yOjMpYz1BWy0tbCtpXTsKCQlyZXR1cm4gYn07CgllbHNlIGlmKGEpYT1mdW5jdGlvbihBLGksbCxiKXsKCQlmb3IoYj0wO2w7KWIrPUFbLS1sK2ldPDEyOD8xOjI7CgkJcmV0dXJuIGJ9OwoJZWxzZSBpZihiKWE9ZnVuY3Rpb24oQSxpLGwsYil7CgkJZm9yKGI9MDtsOyliKz1BWy0tbCtpXTwxMjg/MTozOwoJCXJldHVybiBifTsKCWVsc2UgYT1mdW5jdGlvbihBLGksbCl7cmV0dXJuIGx9OwoJcmV0dXJuIGEKfQoKLy8gY29tcHJlc3NvcgovLyBAUzogIHRleHQKLy8gQGxvdzpsb3dlc3QgZ2Fpbi4gMSBvciAyCi8vIEBjZjogY29waWVzIGZhY3RvcgovLyBAZ2Y6IGdhaW4gZmFjdG9yCi8vIEBsZjogbGVuZ3RoIGZhY3RvcgovLyBAdGI6IFRpZWJyZWFrZXIuIDEgaXMgbG9uZ2VzdCBzdWJzdHJpbmcgZmlyc3QsIC0xIGlzIG1vc3QgY29waWVzIGZpcnN0LgpmdW5jdGlvbiBDRkcoUyxsb3csY2YsZ2YsbGYsdGIpewoJdmFyIGE9Uy5sZW5ndGgsIGI9MCwgYywgcT1TLnNwbGl0KCciJykubGVuZ3RoPlMuc3BsaXQoIiciKS5sZW5ndGg/IiciOiciJywgej1hLCBBPUFycmF5KGEpLCBCPVtdLCBDPVtdLCBJPUFycmF5KGEpLCBVPXt9LCBieXRlcz1zaXplRnVuYyhTKTsKCWZvcig7YTtVW1MuY2hhckF0KGEpXT0xKUFbSVstLWFdPWFdPVMuY2hhckNvZGVBdChhKTsKCWZvcihVW3FdPTE7KythPDEyNzspVVtjPVN0cmluZy5mcm9tQ2hhckNvZGUoYSldfHxhXjEwJiZhXjEzJiZhXjkyJiZDLnB1c2goYyk7CglBLmdldD1mdW5jdGlvbihfLEksaSxjLGQpewoJCWM9bm9ybWFsaXplKEksaSxjLGQpOwoJCXZhciBiPUFbaT1JW2ldXSwgeD1BW2RdLCB2OwoJCWlmKGM+MSYmKGI8MHhkYzAwfHxiPjB4ZGZmZikmJih4PDB4ZDgwMHx8eD4weGRiZmYpKXsJLy8gYXZvaWQgc3Vycm9nYXRlIHBhaXIKCQkJYj1ieXRlcyhfLGksKytkKSwgeD1iKmMtYy1iLWxvdywgdj1nZip4K2xmKmIrY2YqYzsKCQkJaWYoeD4wJiYodj5iZXN0fHxiZXN0PT09diYmKHg+d3x8eD09PXcmJnRiKmM+dGIqZikpKQoJCQkJdz14LCBmPWMsIGw9ZCwgcD1pLCBiZXN0PXYsIHNpemU9YjsKCQl9Cgl9Cglmb3IoO2M9Qy5wb3AoKTtCW2IrK109Yyl7CgkJdmFyIGJlc3Q9MCwgc2l6ZT0wLCBmPTAsIHA9MCwgbD0wLCB3PTA7CgkJQVt6XT0tMSwgc3NtaW5lKEEsSSwwLHosMCk7CgkJaWYodzwxKWJyZWFrOwoJCVM9Uy5zcGxpdCh3PVMuc3Vic3RyKHAsbCkpLmpvaW4oYykrYyt3OwoJCWZvcihhPXo9QVtsPSJsZW5ndGgiXT1JW2xdPVNbbF07YTspQVtJWy0tYV09YV09Uy5jaGFyQ29kZUF0KGEpCgl9CglhPVMuc3BsaXQoJyInKS5sZW5ndGgsIGI9Uy5zcGxpdCgiJyIpLmxlbmd0aDsKCWlmKFVbJ1xcJ10pUz1TLnJlcGxhY2UoL1xcL2csIlxcXFwiKTsKCWlmKFVbJ1xuJ10pUz1TLnJlcGxhY2UoL1xuL2csIlxcbiIpOwoJaWYoVVsnXHInXSlTPVMucmVwbGFjZSgvXHIvZywiXFxyIik7CglpZihhPmIpewoJCWlmKGI+MSlTPVMucmVwbGFjZSgvJy9nLCJcXCciKTthPSInIjsKCX1lbHNlewoJCWlmKGE+MSlTPVMucmVwbGFjZSgvIi9nLCdcXCInKTthPSciJwoJfQoJcmV0dXJuIl89IithK1MrYSsiO2ZvcihpIGluIEc9IitxK0IucmV2ZXJzZSgpLmpvaW4oIiIpK3ErIil3aXRoKF8uc3BsaXQoR1tpXSkpXz1qb2luKHBvcCgpKTtldmFsKF8pIgp9

@Siorki
Copy link
Owner Author

Siorki commented Jan 13, 2019

I'm having difficulties reading your code, too much minification.
Can you please elaborate on the actions performed by the functions normalize() and ssmine() ?

@da0ka
Copy link

da0ka commented Jan 14, 2019

normalize() correct pattern count. I is suffix array(almost), i is start position, c is count, d is pattern length-1.

input:ababab
not normalized:
"ab" x 3 (position:0,2,4)
"aba" x 2 (position:0,2)
"abab" x 2 (position:0,2)
"ba" x 2 (position:1,3)
"bab" x 2 (position:1,3)

normalized:
"ab" x 3
"ba" x 2
other 1

ssmine looks like quick sort(almost), enumerating all substrings by suffix array.
below is original source.
http://2boy.org/~yuta/publications/substring_mining.pdf

@da0ka
Copy link

da0ka commented Jan 14, 2019

function normalize(I,i,c,d){ if(c<3){ // occured 2 times. if distance is lower than length, c is 1 i=I[i]-I[i+1];if(i<0)i=-i;i>d||c-- } else{ for(var f=0,n=c,H=[];c;)H[--c]=I[i+c]; // copy all position of pattern if(n<4){ // occured 3 times, sort without loop if(H[1]>H[2])i=H[1],H[1]=H[2],H[2]=i; if(H[0]>H[1]){i=H[0],H[0]=H[1],H[1]=i; if(H[1]>H[2])i=H[1],H[1]=H[2],H[2]=i}} else n<10?insertSort(H,0,n):tqsort(H,0,n-1); // occured many times for(d++;f<n;c++) // check all distance(>=length) for(i=d+H[f];i>H[++f];); }return c }

@da0ka
Copy link

da0ka commented Jan 14, 2019

formatted and base64encoded
ZnVuY3Rpb24gbm9ybWFsaXplKEksaSxjLGQpewoJaWYoYzwzKXsJLy8gb2NjdXJlZCAyIHRpbWVzLiBpZiBkaXN0YW5jZSBpcyBsb3dlciB0aGFuIGxlbmd0aCwgYyBpcyAxCgkJaT1JW2ldLUlbaSsxXTtpZihpPDApaT0taTtpPmR8fGMtLQoJfQoJZWxzZXsKCQlmb3IodmFyIGY9MCxuPWMsSD1bXTtjOylIWy0tY109SVtpK2NdOwkvLyBjb3B5IGFsbCBwb3NpdGlvbiBvZiBwYXR0ZXJuCgkJaWYobjw0KXsJLy8gb2NjdXJlZCAzIHRpbWVzLCBzb3J0IHdpdGhvdXQgbG9vcAoJCQlpZihIWzFdPkhbMl0paT1IWzFdLEhbMV09SFsyXSxIWzJdPWk7CgkJCWlmKEhbMF0+SFsxXSl7aT1IWzBdLEhbMF09SFsxXSxIWzFdPWk7CgkJCWlmKEhbMV0+SFsyXSlpPUhbMV0sSFsxXT1IWzJdLEhbMl09aX19CgkJZWxzZSBuPDEwP2luc2VydFNvcnQoSCwwLG4pOnRxc29ydChILDAsbi0xKTsJLy8gb2NjdXJlZCBtYW55IHRpbWVzCgkJZm9yKGQrKztmPG47YysrKQkvLyBjaGVjayBhbGwgZGlzdGFuY2UoPj1sZW5ndGgpCgkJCWZvcihpPWQrSFtmXTtpPkhbKytmXTspOwoJfXJldHVybiBjCn0=

@da0ka
Copy link

da0ka commented Jan 14, 2019

sample code.
var text="mississippi";
var U=[], Idx=[], i=text.length, len=i;
U[i]=-1; // set End of symbol
U.get=function(U,Idx,i,count,len){ // set callback
console.log(++len,count,text.substr(Idx[i],len))
};
for(;i;)U[Idx[--i]=i]=text.charCodeAt(i); // set index and Unicode
ssmine(U,Idx,0,len,0) // first call, arguments[2] and arguments[4] must be 0

@Siorki
Copy link
Owner Author

Siorki commented Jan 14, 2019

OK, thanks for the explanation. I realize now I never needed an equivalent to normalize(), as I resumed the search after the first match, implicitely making sure there was no overlap.

I'll run your implementation along with my current dev and compare the results.

@Siorki
Copy link
Owner Author

Siorki commented Jan 20, 2019

Here are the benchmarks (updated with latest improvement, see next post)

Sample RegPack 5.0.2 your algo my current dev
2010 - Christmas Tree.js 980 980 981 980
2012 - A rose is a rose.js 957 957 957
2012 - Autumn Evening.js 978 978 983 979
2012 - Mine(Love)craft.js 1013 1013 1024 1013
2013 - 3D City tour.js 1015 1013 1015
2013 - Color Factors.js 994 994 1010 993
2013 - Comanche.js 1004 1004 1024 1004
2013 - Furbee.js 1006 1007 1008
2013 - Pointillism.js 1018 1018 1019 1018
2013 - Space Time Fracture.js 995 997 995
2013 - StrangeCrystals II.js 1033 1035 1036 1033
2013 - Synth Sphere.js 1012 1012 1012
2013 - Winter Wrap up.js 1019 1019 1019
2014 - Dragon Drop.js 1011 1008 1026 1011
2014 - Flappy Dragon Classic.js 931 936 935 931
2014 - Highway at night.js 1036 1038 1039
2014 - Minecraft.js 1051 1051 1052 1051
2015 - Defender.js 1034 1034 1041 1035
2015 - Mysterious Monorail.js 1041 1041 1044 1041
2015 - Impossible road.js 1017 1016 1021 1018
2016 - Romanesco 2.0.js 1041 1041 1067 1042
2016 - Voxeling.js 1050 1050 1054 1050
2016 - Firewatch.js 1062 1063 1089 1061
jscrush.js 1011 1021 1011
Total time spent 168s 21s 23s

The speed gain is considerable (x8), however there are variations in the compression scheme, sometimes for the better, sometimes for the worse. My implementation tends to err on the worst side, yours is balanced and 10% faster.

I went to analyse those differences and realized that the initial hypothesis "only keep maximal patterns" does not always hold. Specifically, it fails when two patterns overlap by a single character.

Example from 2010 - Christmas tree : in the string fillStyle=k;strokeStyle=r;p=k;, both Style= and =k are patterns. In the former algorithm (the one present in 5.0.2), =k is packed first, then Style (without the final =).
Following the hypothesis, Style is not maximal, thus it is not kept in the candidate matches. Style= is present, but gets discarded once =k is packed as it no longer registers a single occurrence.

#7 is supposed to improve this by comparing options when there is an overlap (two sandboxes, one packs Style=, the other one =k), however this will not solve completely the issue. We need a more thorough solution, back to the drawing board.

@Siorki
Copy link
Owner Author

Siorki commented Jan 26, 2019

I figured out a workaround for the overlap (shown in the table above by striking out former results).

After each round of packing, all remaining patterns are reevaluated. If there are less than two occurrences in the string, the pattern is not immediately discarded. Instead, the first character is removed, and occurrences counted again, until there are at least two. The same is done again by removing characters from the end. Thus, patterns that were not maximal initially are kept until packed or invalidated.

The score is still different from the original algorithm : with exact ties (same length and copies count), the picking order may vary, leading to a different compression. #7 will solve this by testing all solutions and keeping only the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants