Add named custom dictionary support #200

Rycochet · 2024-02-05T22:54:12Z

The code in the custom dictionary encoders that actually encodes it into the dictionary doesn't do it as expected, I've not had enough time to find a solution, but it should encode the same regardless of the method used when you have a known dictionary - most specifically if you try to use the base64 dictionary (exported from the base64 folder) it should be identical to direct base64 encoding.

The difficulty is that some dictionaries are a valid bit-size of dictionary (ie, take the dictionary.length and see if it's a perfect binary representation, such as base64 being binary 0b100 0000 - or 6 bits of data).

For these it should be relatively easy to use the in-built _compress code and pass that bit length.

When not using a perfect bit size it instead needs to use a rolling value to find the next bit used (as now) - but just like Pythagorus to Trigonometry, it should still be compatible.

This adds all the wrappers for it, and we need to have final code that allows these two to match (ignoring any = padding characters) -

$ bin/cli.cjs -v -e custom -c base64 test/data/tattoo/data.bin
KZlJ/qHFaHLwNmVC1PIsuOGCO8aAEJB=sCvPEs4GQ1EBTBGDMiiMCYIElHB4IDOBeFBQxMozAI5Cx6AGdK+IEkTD9...

$ bin/cli.cjs -v -e base64 test/data/tattoo/data.bin
q4iksnTmdoLgSC6Gj2LolvZkAQ0g7S0ATWhnS0JYEgrO0MamhTF0AITZGCQUwtBnAkEtrQGjtNBLO0EsXKZQAnNA7Q0...

Which will also match the content of test/data/tattoo/base64.bin -

q4iksnTmdoLgSC6Gj2LolvZkAQ0g7S0ATWhnS0JYEgrO0MamhTF0AITZGCQUwtBnAkEtrQGjtNBLO0EsXKZQAnNA7Q00FtbQO1dACU0CmToAdnQHb2gU2dE7QoFNjQDY2gAmdAhsazAJqaAiAkCGtoCdjQA72gQwtApuaAiZ0C2BIDd7PiXKBDAgGACQymBTIiiBXAkBEDuUyBrO0AEdIAsVPm19AyVAJydAAk9Ae2dAU1tAewJAAjNACxdLVFIyQCMCQBN7QHJPeUBPG0AHW0ACC0B7O0BTE0AaZ3TAF1cKQFt7PkSoW0AzAkAbS0BTG0JAS3dE7wpAQxdwwFcHQBoXQFMCQCNbFTLicmpAOjtAFQI9QHIXeUALdygTQGtDQGdTcmgYQAJ7QEs3QEtnQBsjPh4ywAInZMBrZ2QfQEMRQAsbQFd7c75AS1VAUzdAM5OgBdLXKAG0FcIAxAkANoaAU2tABZ2gFtDTDyMr/QD2NoBLBzA1ggOkAxpbQwDGBAJERA9IACF0A9u6Aayd5IBPIwggCdDAyMSBkEyAY0NAPZOgAsHQDmhgdKXp/oAnAkAtqaAS0NAKZ2hCg82BfEAZvaHWCAOntAHIEgHdTLl8wXmSwQY6Ae2NAM52gFtXRp8QBODoA3U0A9paAZ1c8oAHAkA5pbRQAudoAzWk6hCChnkUkSgDtTExuwCGTvgjclCqVKuN/p7AIausnCAyoONk5EAJgSAU09AO4EwKKiXFPkA6kRkQAEdsaCky2MCjYBDGzEgFonQCGRmJCANAC4ugHt7CVmaEoAuASwtABbGgE8CT4UQCmpuR0oAnV1NiGSgGNnJngVd8S7ofCAlAQUdKCSAU3dAG7kfDK50AduYqaw5wCuTqwmjnAKb2gDtnQAu5KLAIamaZFExAGNXIolUAWjMBkAcydABdTcZADMzXBRkAAiJKElQA6EzUQBTdUAJ3I+BKCkoOgth2HCFUyEAOwJACd7DRAAIbZ4iUAG1NAGcbRBImnAsRC/OdrEAGwJABd7BJkEAGzNEQ0RgRB8QA7SRaTlRnSQBnQxUWIq2Sa1AGN7PZ5CaAtAHJTARrEAF2NU3CR1kEAGnMCz4QB7E0AcjsKUSNBEkAJ1M+G6WR0j4EjhNE8TrApaEaKkYEC0+QATG0M4zAFMjaxPUAWktvI0GBAE8nQAAOiAA=

github-actions · 2024-02-05T22:54:42Z

Package	Line Rate	Branch Rate	Health
src	69%	89%	➖
src.UTF16	100%	100%	✔
src.Uint8Array	100%	100%	✔
src.base64	17%	88%	❌
src.custom	100%	95%	✔
src.encodedURIComponent	100%	100%	✔
src.raw	100%	100%	✔
Summary	65% (901 / 1396)	93% (161 / 174)	➖

HelloLudger · 2024-02-05T23:25:45Z

Are you thinking too complicated? The named dictionary for base64 has 65 characters, it includes the =.
The custom function can't produce the same output given the input, it will produce a string with = on numerous places.

Also, it cannot replicate the padding with =, even if you provide only the real 64 characters, so it will also be different for the provided example.

Even with the correct dictionary and when no padding is needed, I am not sure if it would produce the same output, since it works differently...

With 64 characters it's simply an encoding with the base of 64, but not the base64.

Rycochet · 2024-02-05T23:44:28Z

Good catch regarding the dictionary itself - I've fixed and pushed - but that doesn't make any appreciable difference to the output!

The padding itself is a non-issue, it's sort of standard with a lot of baseXX encodings, but the decoders also are happy to generally ignore them, so making strictly correct base64 requires it, the decoder should work without (and it should look identical right up to those padding characters) :-)

And I agree about the "working differently" part - but that's not how people will expect it to work. as a least I feel that using a bit-safe dictionary should make it look identical (so maybe something to choose to pass that bit-size and use _compress if possible, otherwise the other way).

I added the -e custom -c <dict> option to the cli command in this branch which makes it easier to look at.

When choosing to use base16 with tattoo (chosen for the length) you get 1789 characters, and base64 has 1195 characters - close to 2/3, so it's definitely valid in that respect, and both the -v validation option and manual testing proves that it encodes and decodes correctly - so there's no issues in the general form of this - purely that the user expectations of power-of-two sized dictionaries should be able to encode in a compatible way to the standards if using the same dictionary.

I hope that makes sense! :-D

HelloLudger · 2024-02-06T07:55:48Z

So your plan is to use _compress/_decompress if the dictionary length is 2^X and use the current custom... function if not?

That doesn't sound too complicated, I could do that, maybe at the weekend.

Rycochet · 2024-02-06T09:21:13Z

That's sort of the fallback plan - will definitely work, but it introduces two code paths for it (which it sort of has already) - if I get the time I was wanting to see if there's anything that can be done to tweak the code to make it more "natural" - either yours (was wanting to look at if there's a simple transpose of the data that would result in it being "correct" which would then be a mathematical solution), or potentially looking at _compress itself and see if instead of taking a numeric bit-length, it could take that or a string dictionary and put it in there (though I think that might be more complicated, I'm not sure if that's not the best future-proof solution!) :-P

Add named custom dictionary support

95142a1

Rycochet mentioned this pull request Feb 5, 2024

Added compressToCustom and decompressFromCustom function #183

Merged

Correct base64 dictionary

f3735c7

Rycochet mentioned this pull request Feb 20, 2024

Can utf-8 codes be avoided in the output? #182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add named custom dictionary support #200

Add named custom dictionary support #200

Rycochet commented Feb 5, 2024 •

edited

Loading

github-actions bot commented Feb 5, 2024

HelloLudger commented Feb 5, 2024

Rycochet commented Feb 5, 2024

HelloLudger commented Feb 6, 2024

Rycochet commented Feb 6, 2024

Add named custom dictionary support #200

Are you sure you want to change the base?

Add named custom dictionary support #200

Conversation

Rycochet commented Feb 5, 2024 • edited Loading

github-actions bot commented Feb 5, 2024

HelloLudger commented Feb 5, 2024

Rycochet commented Feb 5, 2024

HelloLudger commented Feb 6, 2024

Rycochet commented Feb 6, 2024

Rycochet commented Feb 5, 2024 •

edited

Loading