-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundancy lookup : keep only maximal patterns #93
Comments
below(base64 coded) is test program without hash table. It's fast and compression ratio is almost the same as v5.0.2. ZnVuY3Rpb24gaW5zZXJ0U29ydChBLGksbCl7CiBmb3IodmFyIGE9aSxjLGosazsrK2k8bDtBW2pdPWMpCglpZigoYz1BW2o9az1pXSk8QVthXSlmb3IoO2o+YTspQVtqXT1BWy0tal07CgllbHNlIGZvcig7YzxBWy0ta107KUFbal09QVtqPWtdfQoKZnVuY3Rpb24gdHFzb3J0KEEsYSxiKXtpZihhPGIpewogaWYoYi1hPDEwKXJldHVybiBpbnNlcnRTb3J0KEEsYSxiKzEpOwogdmFyIGM9QVthXSxkLGk9QVtiXSxqPWEsaz1iLHA9QVthK2I+PjFdOwogYz5pP2M8PXA/cD1jOnA+aXx8KHA9aSk6aTw9cD9wPWk6Yz5wJiYocD1jKTsKIGZvcihpPWE7ajw9azspCglpZigoYz1BW2pdKTxwKWQ9QVtpXSxBW2krK109YyxBW2orK109ZDsKCWVsc2UgaWYoYz5wKUFbal09QVtrXSxBW2stLV09YzsKCWVsc2UgaisrOwogdHFzb3J0KEEsYSwtLWkpO3Rxc29ydChBLCsrayxiKX19CgovLyBleGNsdWRlIG92ZXJsYXBwaW5nIHN1YnN0cmluZwpmdW5jdGlvbiBub3JtYWxpemUoSSxpLGMsZCl7CglpZihjPDMpewoJCWk9SVtpXS1JW2krMV07aWYoaTwwKWk9LWk7aT5kfHxjLS19CgllbHNlewoJCWZvcih2YXIgZj0wLG49YyxIPVtdO2M7KUhbLS1jXT1JW2krY107CgkJaWYobjw0KXsKCQkJaWYoSFsxXT5IWzJdKWk9SFsxXSxIWzFdPUhbMl0sSFsyXT1pOwoJCQlpZihIWzBdPkhbMV0pe2k9SFswXSxIWzBdPUhbMV0sSFsxXT1pOwoJCQlpZihIWzFdPkhbMl0paT1IWzFdLEhbMV09SFsyXSxIWzJdPWl9fQoJCWVsc2UgbjwxMD9pbnNlcnRTb3J0KEgsMCxuKTp0cXNvcnQoSCwwLG4tMSk7CgkJZm9yKGQrKztmPG47YysrKQoJCQlmb3IoaT1kK0hbZl07aT5IWysrZl07KTsKCX1yZXR1cm4gYwp9CnZhciBtaW5zdXA9MjsJLy8gbWluaW11bSBmcmVxdWVuY3kgb2Ygc3Vic3RyaW5nCgovLyBlbnVtZXJhdGluZyBhbGwgc3Vic3RyaW5ncwovLyBAUzogVW5pY29kZSBBcnJheQovLyBASTogSW5kZXggQXJyYXkKLy8gQGk6IHN0YXJ0Ci8vIEBuOiBlbmQKLy8gQGx2OiBsZW5ndGgKZnVuY3Rpb24gc3NtaW5lKFMsSSxpLG4sbHYpewoJaWYobi1pPG1pbnN1cClyZXR1cm47Cgl2YXIgcD1pK01hdGgucmFuZG9tKCkqKG4taSl8MCxyPUlbaV07CglJW2ldPUlbcF0sSVtwXT1yOwoJZm9yKHZhciB0PVNbSVtpXStsdl0sYT1pKzEsYz1uLTEsYj1hLGQ9Yzs7SVtjLS1dPXIpewoJCWZvcig7Yjw9YyYmKHI9U1tJW2JdK2x2XS10KTwxO2IrKykKCQkJcnx8KHI9SVthXSxJW2ErK109SVtiXSxJW2JdPXIpOwoJCWZvcig7Yjw9YyYmKHI9U1tJW2NdK2x2XS10KT49MDtjLS0pCgkJCXJ8fChyPUlbY10sSVtjXT1JW2RdLElbZC0tXT1yKTsKCQlpZihiPmMpYnJlYWs7CgkJcj1JW2JdLElbYisrXT1JW2NdCgl9CglwPWEtaSxyPWItYTtpZihwPHIpcj1wOwoJZm9yKHA9aSxiLT1yO3ItLTtJW2IrK109dCl0PUlbcF0sSVtwKytdPUlbYl07CglwPWQtYyxyPW4rfmQ7aWYocDxyKXI9cDsKCWZvcihwPWIsbi09cjtyLS07SVtuKytdPXQpdD1JW3BdLElbcCsrXT1JW25dOwoJc3NtaW5lKFMsSSxpLGkrPWItYSxsdiksYis9bit+ZDsKCWlmKGItaT49bWluc3VwJiZ+dClsdiYmUy5nZXQoUyxJLGksYi1pLGx2KSxzc21pbmUoUyxJLGksYixsdisxKQoJc3NtaW5lKFMsSSxuLWQrYyxuLGx2KQp9CgpmdW5jdGlvbiBzaXplRnVuYyhzKXsKCXZhciBhPS9bXHg4MC1cdTA3ZmZdLy50ZXN0KHMpLGI9L1tcdTA4MDAtXHVGRkZGXS8udGVzdChzKTsKCWlmKGEmJmIpYT1mdW5jdGlvbihBLGksbCxiLGMpewoJCWZvcihiPTA7bDtiKz1jPDEyOD8xOmM8MjA0OD8yOjMpYz1BWy0tbCtpXTsKCQlyZXR1cm4gYn07CgllbHNlIGlmKGEpYT1mdW5jdGlvbihBLGksbCxiKXsKCQlmb3IoYj0wO2w7KWIrPUFbLS1sK2ldPDEyOD8xOjI7CgkJcmV0dXJuIGJ9OwoJZWxzZSBpZihiKWE9ZnVuY3Rpb24oQSxpLGwsYil7CgkJZm9yKGI9MDtsOyliKz1BWy0tbCtpXTwxMjg/MTozOwoJCXJldHVybiBifTsKCWVsc2UgYT1mdW5jdGlvbihBLGksbCl7cmV0dXJuIGx9OwoJcmV0dXJuIGEKfQoKLy8gY29tcHJlc3NvcgovLyBAUzogIHRleHQKLy8gQGxvdzpsb3dlc3QgZ2Fpbi4gMSBvciAyCi8vIEBjZjogY29waWVzIGZhY3RvcgovLyBAZ2Y6IGdhaW4gZmFjdG9yCi8vIEBsZjogbGVuZ3RoIGZhY3RvcgovLyBAdGI6IFRpZWJyZWFrZXIuIDEgaXMgbG9uZ2VzdCBzdWJzdHJpbmcgZmlyc3QsIC0xIGlzIG1vc3QgY29waWVzIGZpcnN0LgpmdW5jdGlvbiBDRkcoUyxsb3csY2YsZ2YsbGYsdGIpewoJdmFyIGE9Uy5sZW5ndGgsIGI9MCwgYywgcT1TLnNwbGl0KCciJykubGVuZ3RoPlMuc3BsaXQoIiciKS5sZW5ndGg/IiciOiciJywgej1hLCBBPUFycmF5KGEpLCBCPVtdLCBDPVtdLCBJPUFycmF5KGEpLCBVPXt9LCBieXRlcz1zaXplRnVuYyhTKTsKCWZvcig7YTtVW1MuY2hhckF0KGEpXT0xKUFbSVstLWFdPWFdPVMuY2hhckNvZGVBdChhKTsKCWZvcihVW3FdPTE7KythPDEyNzspVVtjPVN0cmluZy5mcm9tQ2hhckNvZGUoYSldfHxhXjEwJiZhXjEzJiZhXjkyJiZDLnB1c2goYyk7CglBLmdldD1mdW5jdGlvbihfLEksaSxjLGQpewoJCWM9bm9ybWFsaXplKEksaSxjLGQpOwoJCXZhciBiPUFbaT1JW2ldXSwgeD1BW2RdLCB2OwoJCWlmKGM+MSYmKGI8MHhkYzAwfHxiPjB4ZGZmZikmJih4PDB4ZDgwMHx8eD4weGRiZmYpKXsJLy8gYXZvaWQgc3Vycm9nYXRlIHBhaXIKCQkJYj1ieXRlcyhfLGksKytkKSwgeD1iKmMtYy1iLWxvdywgdj1nZip4K2xmKmIrY2YqYzsKCQkJaWYoeD4wJiYodj5iZXN0fHxiZXN0PT09diYmKHg+d3x8eD09PXcmJnRiKmM+dGIqZikpKQoJCQkJdz14LCBmPWMsIGw9ZCwgcD1pLCBiZXN0PXYsIHNpemU9YjsKCQl9Cgl9Cglmb3IoO2M9Qy5wb3AoKTtCW2IrK109Yyl7CgkJdmFyIGJlc3Q9MCwgc2l6ZT0wLCBmPTAsIHA9MCwgbD0wLCB3PTA7CgkJQVt6XT0tMSwgc3NtaW5lKEEsSSwwLHosMCk7CgkJaWYodzwxKWJyZWFrOwoJCVM9Uy5zcGxpdCh3PVMuc3Vic3RyKHAsbCkpLmpvaW4oYykrYyt3OwoJCWZvcihhPXo9QVtsPSJsZW5ndGgiXT1JW2xdPVNbbF07YTspQVtJWy0tYV09YV09Uy5jaGFyQ29kZUF0KGEpCgl9CglhPVMuc3BsaXQoJyInKS5sZW5ndGgsIGI9Uy5zcGxpdCgiJyIpLmxlbmd0aDsKCWlmKFVbJ1xcJ10pUz1TLnJlcGxhY2UoL1xcL2csIlxcXFwiKTsKCWlmKFVbJ1xuJ10pUz1TLnJlcGxhY2UoL1xuL2csIlxcbiIpOwoJaWYoVVsnXHInXSlTPVMucmVwbGFjZSgvXHIvZywiXFxyIik7CglpZihhPmIpewoJCWlmKGI+MSlTPVMucmVwbGFjZSgvJy9nLCJcXCciKTthPSInIjsKCX1lbHNlewoJCWlmKGE+MSlTPVMucmVwbGFjZSgvIi9nLCdcXCInKTthPSciJwoJfQoJcmV0dXJuIl89IithK1MrYSsiO2ZvcihpIGluIEc9IitxK0IucmV2ZXJzZSgpLmpvaW4oIiIpK3ErIil3aXRoKF8uc3BsaXQoR1tpXSkpXz1qb2luKHBvcCgpKTtldmFsKF8pIgp9 |
I'm having difficulties reading your code, too much minification. |
normalize() correct pattern count. I is suffix array(almost), i is start position, c is count, d is pattern length-1. input:ababab normalized: ssmine looks like quick sort(almost), enumerating all substrings by suffix array. |
|
formatted and base64encoded |
sample code. |
OK, thanks for the explanation. I realize now I never needed an equivalent to I'll run your implementation along with my current dev and compare the results. |
Here are the benchmarks (updated with latest improvement, see next post)
The speed gain is considerable (x8), however there are variations in the compression scheme, sometimes for the better, sometimes for the worse. My implementation tends to err on the worst side, yours is balanced and 10% faster. I went to analyse those differences and realized that the initial hypothesis "only keep maximal patterns" does not always hold. Specifically, it fails when two patterns overlap by a single character. Example from 2010 - Christmas tree : in the string #7 is supposed to improve this by comparing options when there is an overlap (two sandboxes, one packs |
I figured out a workaround for the overlap (shown in the table above by striking out former results). After each round of packing, all remaining patterns are reevaluated. If there are less than two occurrences in the string, the pattern is not immediately discarded. Instead, the first character is removed, and occurrences counted again, until there are at least two. The same is done again by removing characters from the end. Thus, patterns that were not maximal initially are kept until packed or invalidated. The score is still different from the original algorithm : with exact ties (same length and copies count), the picking order may vary, leading to a different compression. #7 will solve this by testing all solutions and keeping only the best. |
Issue spun off from first point of #7
For instance, when performing forward lookup on
getData(0,0)
:ge
: 20 matchesget
: 20 matchesgetD
: 6 matchesgetDa
: 6 matchesgetDat
: 6 matchesgetData
: 6 matchesgetData(
: 6 matchesgetData(0
: 2 matchesgetData(0,
: 2 matchesgetData(0,0
: 1 match (i.e. no other occurrence)All occurrences of
getD
(6 of them) are also occurrences ofgetData(
. Said otherwise, all occurrences ofgetD
are followed byata(
. Thus it makes sense to pack, for the same cost of one token,getData(
instead ofgetD
. Going further, it makes sense not to keepgetD
in the candidate patterns. Same goes forge
and a few others and we shall only retain :get
: 20 matchesgetData(
: 6 matchesgetData(0,
: 2 matchesThis should improve packing speed by testing fewer candidates, and also take care of #46 by lowering the memory footprint. Further testing shall make sure it does not worsen the compression.
The text was updated successfully, but these errors were encountered: