Skip to content
/ entofu Public

Encodes binary data as valid unassigned Unicode code points, also known as tofu.

License

Notifications You must be signed in to change notification settings

joakim/entofu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note: Work in progress. This is a draft, not a finished spec.

□ Entofu

A binary-to-text encoding that encodes binary data as valid unassigned Unicode code points, also known as tofu.

Binary data is stuffed into 524,288 code points of the unassigned Unicode planes 4 to 11. 1 That's the 8 empty planes in the middle of the Unicode codespace, or almost half of Unicode.

Each 32-bit code point holds 19 bits of binary data, 1 in its first byte and 3 × 6 in its continuation bytes. 2

1st byte 2nd byte 3rd byte 4th byte
Hex F1 8F BF 80
Binary 11110001 10001111 10111111 10000000
Mask 111100__ 10______ 10______ 10______

This lets you embed binary data inside valid Unicode text. 3

Put differently, it's a Base524288 encoding that uses Unicode planes 4 to 11 as its alphabet.

Reference implementation

See src/index.ts for a simple reference implementation of the encoder in only 40 lines of code.

Efficiency

Each character contains ~3× more data than Base64, making it visually much smaller. This comes at a cost of twice the overhead of Base64, relative to the original binary data, when stored in memory or on disk.

In other words, this is not suitable for large binaries if size matters (nor is Base64). But it's useful for encoding smaller binary data, such as UUIDs and hashes, where length matters. Or even larger binaries if the inflation in size is an acceptable tradeoff.

Some theoretical numbers

Base64 Base524288
Efficiency 75% 59.375%
Size 133.333% 168.4210526%
Length 133.333% 42.1052632%

Actual numbers will vary depending on the amount of padding.

Textual representation

Each unassigned code point will be displayed as a missing glyph – that is, a tofu – which differs by system and font. 4

Unlike many base encodings, the encoded text doesn't contain characters that have special meaning in code and protocols. And unlike the related Base122, it doesn't contain characters that make keyboard navigation, selection and copy/paste difficult. Tofus are unproblematic.

That said, tofus aren't exactly typable. They're only vaguely readable if they show their code points, like in Firefox, otherwise it's all tofu.

Examples

Input Output Length Size in UTF-8
128-bit 򂓧򒳫񴮕񯐨򼶘񅼍򈦠 7 tofus 224 bits (175%)
256-bit 򏘲񭯸򡋒񅉚񈭼򛬚񛊌򡡴񛕱򥕩򯿖򞞨񂔜򰠀 14 tofus 448 bits (175%)
512-bit 򱞂򶭼񰈶򫺬򞗅򧤝򵿕򊓱񎳱񭾡񁿄򮚗񳶂򞥵񰈣񼸇򱟆򐗑񍰒򠂸򵣬񆢱񙂙񇍁񙧠񥬷񫛞 27 tofus 864 bits (168.75%)

Compared to popular binary-to-text encodings of UUIDs…

Encoding Output Length Size in UTF-8
Base16  90f119cf-9fc4-4090-acc1-0000bc711dc3 36 chars 288 bits (225%)
Base64 kPEZz5/EQJCswgAAvHEdww 22 chars 176 bits (137.5%)
Base524288 򩦠򄢧򮨲񞌶񒧼񳓜񶄠 7 tofus 224 bits (175%)

Base524288 encoded UUIDs are:

  • Almost ¾ the size of the standard UUID format
  • Almost the size of the Base64 encoding

In a monospaced typeface, they are:

  • Less than the length of the standard UUID format
  • Less than the length of the Base64 encoding

In a proportional typeface, they are:

  • About the length of the standard UUID format
  • About ½ the length of the Base64 encoding

Noncharacters

Unicode reserves the last two code points of each plane as noncharacters – characters that don't want to be characters.

When encoding, any noncharacters that appear must therefore be replaced with their respective substitute code points.

Special tofu:

  • U+4FFFEU+C03FE
  • U+4FFFFU+C03FF
  • U+5FFFEU+C07FE
  • U+5FFFFU+C07FF
  • U+6FFFEU+C0BFE
  • U+6FFFFU+C0BFF
  • U+7FFFEU+C0FFE ☕️
  • U+7FFFFU+C0FFF
  • U+8FFFEU+C13FE
  • U+8FFFFU+C13FF
  • U+9FFFEU+C17FE
  • U+9FFFFU+C17FF
  • U+AFFFEU+C1BFE
  • U+AFFFFU+C1BFF
  • U+BFFFEU+C1FFE
  • U+BFFFFU+C1FFF

When decoding, any substitute code points encountered must be replaced with their respective noncharacters before reading their binary data.

Inspiration

License


Footnotes

  1. Plus 16 substitute code points in plane 12, see Noncharacters.

  2. It looks like 20 bits, but the first two places represent one bit, alternating between 01 and 10, so that it uses the correct planes. 11 is used for noncharacter substitutes.

  3. Making vegan tofu omelette without breaking any eggs, so to speak.

  4. I like the glyph used by Firefox, a rectangle displaying the code point in hex. It has that binary feel to it. I also like GitHub's glyph. It looks like a block of tofu that has been sliced into 6 pieces.

About

Encodes binary data as valid unassigned Unicode code points, also known as tofu.

Topics

Resources

License

Stars

Watchers

Forks