Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add named custom dictionary support #200

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

Rycochet
Copy link
Collaborator

@Rycochet Rycochet commented Feb 5, 2024

The code in the custom dictionary encoders that actually encodes it into the dictionary doesn't do it as expected, I've not had enough time to find a solution, but it should encode the same regardless of the method used when you have a known dictionary - most specifically if you try to use the base64 dictionary (exported from the base64 folder) it should be identical to direct base64 encoding.

The difficulty is that some dictionaries are a valid bit-size of dictionary (ie, take the dictionary.length and see if it's a perfect binary representation, such as base64 being binary 0b100 0000 - or 6 bits of data).

For these it should be relatively easy to use the in-built _compress code and pass that bit length.

When not using a perfect bit size it instead needs to use a rolling value to find the next bit used (as now) - but just like Pythagorus to Trigonometry, it should still be compatible.

This adds all the wrappers for it, and we need to have final code that allows these two to match (ignoring any = padding characters) -

$ bin/cli.cjs -v -e custom -c base64 test/data/tattoo/data.bin
KZlJ/qHFaHLwNmVC1PIsuOGCO8aAEJB=sCvPEs4GQ1EBTBGDMiiMCYIElHB4IDOBeFBQxMozAI5Cx6AGdK+IEkTD9...
$ bin/cli.cjs -v -e base64 test/data/tattoo/data.bin
q4iksnTmdoLgSC6Gj2LolvZkAQ0g7S0ATWhnS0JYEgrO0MamhTF0AITZGCQUwtBnAkEtrQGjtNBLO0EsXKZQAnNA7Q0...

Which will also match the content of test/data/tattoo/base64.bin -

q4iksnTmdoLgSC6Gj2LolvZkAQ0g7S0ATWhnS0JYEgrO0MamhTF0AITZGCQUwtBnAkEtrQGjtNBLO0EsXKZQAnNA7Q00FtbQO1dACU0CmToAdnQHb2gU2dE7QoFNjQDY2gAmdAhsazAJqaAiAkCGtoCdjQA72gQwtApuaAiZ0C2BIDd7PiXKBDAgGACQymBTIiiBXAkBEDuUyBrO0AEdIAsVPm19AyVAJydAAk9Ae2dAU1tAewJAAjNACxdLVFIyQCMCQBN7QHJPeUBPG0AHW0ACC0B7O0BTE0AaZ3TAF1cKQFt7PkSoW0AzAkAbS0BTG0JAS3dE7wpAQxdwwFcHQBoXQFMCQCNbFTLicmpAOjtAFQI9QHIXeUALdygTQGtDQGdTcmgYQAJ7QEs3QEtnQBsjPh4ywAInZMBrZ2QfQEMRQAsbQFd7c75AS1VAUzdAM5OgBdLXKAG0FcIAxAkANoaAU2tABZ2gFtDTDyMr/QD2NoBLBzA1ggOkAxpbQwDGBAJERA9IACF0A9u6Aayd5IBPIwggCdDAyMSBkEyAY0NAPZOgAsHQDmhgdKXp/oAnAkAtqaAS0NAKZ2hCg82BfEAZvaHWCAOntAHIEgHdTLl8wXmSwQY6Ae2NAM52gFtXRp8QBODoA3U0A9paAZ1c8oAHAkA5pbRQAudoAzWk6hCChnkUkSgDtTExuwCGTvgjclCqVKuN/p7AIausnCAyoONk5EAJgSAU09AO4EwKKiXFPkA6kRkQAEdsaCky2MCjYBDGzEgFonQCGRmJCANAC4ugHt7CVmaEoAuASwtABbGgE8CT4UQCmpuR0oAnV1NiGSgGNnJngVd8S7ofCAlAQUdKCSAU3dAG7kfDK50AduYqaw5wCuTqwmjnAKb2gDtnQAu5KLAIamaZFExAGNXIolUAWjMBkAcydABdTcZADMzXBRkAAiJKElQA6EzUQBTdUAJ3I+BKCkoOgth2HCFUyEAOwJACd7DRAAIbZ4iUAG1NAGcbRBImnAsRC/OdrEAGwJABd7BJkEAGzNEQ0RgRB8QA7SRaTlRnSQBnQxUWIq2Sa1AGN7PZ5CaAtAHJTARrEAF2NU3CR1kEAGnMCz4QB7E0AcjsKUSNBEkAJ1M+G6WR0j4EjhNE8TrApaEaKkYEC0+QATG0M4zAFMjaxPUAWktvI0GBAE8nQAAOiAA=

Copy link

github-actions bot commented Feb 5, 2024

Code Coverage

Package Line Rate Branch Rate Complexity Health
src 69% 89% 0
src.UTF16 100% 100% 0
src.Uint8Array 100% 100% 0
src.base64 17% 88% 0
src.custom 100% 95% 0
src.encodedURIComponent 100% 100% 0
src.raw 100% 100% 0
Summary 65% (901 / 1396) 93% (161 / 174) 0

@HelloLudger
Copy link
Contributor

Are you thinking too complicated? The named dictionary for base64 has 65 characters, it includes the =.
The custom function can't produce the same output given the input, it will produce a string with = on numerous places.

Also, it cannot replicate the padding with =, even if you provide only the real 64 characters, so it will also be different for the provided example.

Even with the correct dictionary and when no padding is needed, I am not sure if it would produce the same output, since it works differently...

With 64 characters it's simply an encoding with the base of 64, but not the base64.

@Rycochet
Copy link
Collaborator Author

Rycochet commented Feb 5, 2024

Good catch regarding the dictionary itself - I've fixed and pushed - but that doesn't make any appreciable difference to the output!

The padding itself is a non-issue, it's sort of standard with a lot of baseXX encodings, but the decoders also are happy to generally ignore them, so making strictly correct base64 requires it, the decoder should work without (and it should look identical right up to those padding characters) :-)

And I agree about the "working differently" part - but that's not how people will expect it to work. as a least I feel that using a bit-safe dictionary should make it look identical (so maybe something to choose to pass that bit-size and use _compress if possible, otherwise the other way).

I added the -e custom -c <dict> option to the cli command in this branch which makes it easier to look at.

When choosing to use base16 with tattoo (chosen for the length) you get 1789 characters, and base64 has 1195 characters - close to 2/3, so it's definitely valid in that respect, and both the -v validation option and manual testing proves that it encodes and decodes correctly - so there's no issues in the general form of this - purely that the user expectations of power-of-two sized dictionaries should be able to encode in a compatible way to the standards if using the same dictionary.

I hope that makes sense! :-D

@HelloLudger
Copy link
Contributor

So your plan is to use _compress/_decompress if the dictionary length is 2^X and use the current custom... function if not?

That doesn't sound too complicated, I could do that, maybe at the weekend.

@Rycochet
Copy link
Collaborator Author

Rycochet commented Feb 6, 2024

That's sort of the fallback plan - will definitely work, but it introduces two code paths for it (which it sort of has already) - if I get the time I was wanting to see if there's anything that can be done to tweak the code to make it more "natural" - either yours (was wanting to look at if there's a simple transpose of the data that would result in it being "correct" which would then be a mathematical solution), or potentially looking at _compress itself and see if instead of taking a numeric bit-length, it could take that or a string dictionary and put it in there (though I think that might be more complicated, I'm not sure if that's not the best future-proof solution!) :-P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants