-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UFO4] Separating unicode from glif #77
Comments
I think this is a good idea. This wasn't a problem in the single layer days, but it is now. Glyphs with the same name may have separate Unicode values across layers. There's probably a use case for that, but the general use case is to have one value per name. So, a universal mapping makes sense now. Here's a quick sketch of how I think we could make this work without suddenly breaking lots of code:
Thoughts? |
Strongly disagree. It should be |
Oops, yeah, I had a brain fart. I was thinking about glyphs with multiple Unicode values. This is a much cleaner way to handle that. |
A main design goal also is to make it impossible to have multiple glyphs using the same code point. |
I see two options to format the keys in
I'm leaning towards the second option. |
I prefer the hexified int. |
The formatting should be strictly specified to avoid |
Here's the relevant part of the GLIF spec. This is refreshing my memory on the development of the If a glyph has > 1 code points, how to we indicate the primary one? We handled this in GLIF by saying that the first appearance of a I don't know how we'd handle this in We struggled (aka "were annoyed with") fonts with > 1 glyph mapped to the same code point. I think Verdana had this situation and we were worried about round tripping. Should we worry about this now? It's such an odd edge case. |
It's not an edge case, and by mapping f["Omega"].unicodes = [0x2126, 0x03A9] # OHM SIGN, GREEK CAPITAL LETTER OMEGA vs cmap = {
0x2126: "Omega",
0x03A9: "Omega",
} |
If you mean how do determine from a cmap which is the primary unicode, then yes, that can not be done unambiguously. The concept of "primary unicode value" is flawed, though, and not really needed. |
Yes. That's what I mean. |
Ok, that is then indeed a bw compat issue we can't easily solve. |
But again, that's only a problem if the concept "primary unicode value" has any value. I think it only becomes problematic in code that is too lazy to properly deal with |
If we structured the plist as |
I really think that code not dealing with glyph.unicodes properly is broken, and that the order should not have semantic meaning. |
It would be nice if UVS (Unicode Variation Sequences) were supported (see https://docs.microsoft.com/en-us/typography/opentype/spec/cmap#format-14-unicode-variation-sequences or https://en.wikipedia.org/wiki/Variant_form_(Unicode)). These require sequences of two unicodes (base character and variation selector character) instead of a one unicode at a time. These are useful for CJK, Mongolian, mathematical symbols, emojis and other things. |
Do you have any suggestions for how to do this? I don't know much about these. |
@moyogo: It seems a format 14 cmap subtable is always used together with a regular cmap subtable. So I guess we would be talking about an additional mapping, next to Semantically and practically, I think a structure like this would be most appropriate: uvs = {
unicodeVariationSelector1: ({default1, default2, ...}, {nonDefault1: glyphName1, nonDefault2: glyphName2, ...}),
} The first example in the spec would then look like this: cmap = {
0x82A6: "cid7961",
}
uvs = {
0xE0100: (set(), {0x82A6: "cid1142"}),
0xE0101: ({0x82A6}, {}),
} The <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>E0100</key>
<array>
<array/>
<dict>
<key>82A6</key>
<string>cid1142</string>
</dict>
</array>
<key>E0101</key>
<array>
<array>
<string>82A6</string>
</array>
<dict/>
</array>
</dict>
</plist> |
Or, looking at the internals of the fonttools format 14 implementation, perhaps this is better: uvs = {
0xE0100: {0x82A6: "cid1142"}, # non-default
0xE0101: {0x82A6: None}, # default, refer to cmap
} Slightly nicer plist, too (too bad plist doesn't support None...): <?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>E0100</key>
<dict>
<key>82A6</key>
<string>cid1142</string>
</dict>
<key>E0101</key>
<dict>
<key>82A6</key>
<string></string>
</dict>
</dict>
</plist> |
The TTX dump writes a glyph name of "None" for a default variation. I like an empty string better as nobody will stop you to name a glyph "None" :) |
UVS are a mechanism to have glyph variants at the Unicode character level. |
Given a Unicode Variations Sequences subtable, converted to Python as in my last comment, the following code is my best guess of how the variation selection process works: def isVariationSelector(c):
return 0xFE00 <= c <= 0xFE0F or 0xE0100 <= c <= 0xE01EF
def glyphsFromText(text, cmap, uvs):
text = [ord(c) for c in text]
glyphs = []
for i, c in enumerate(text):
if isVariationSelector(c):
if i > 0:
glyphName = uvs.get(c, {}).get(text[i-1], None)
if glyphName is not None:
glyphs[-1] = glyphName
else:
glyphs.append(cmap.get(c, ".notdef"))
return glyphs
cmap = {
0x82A6: "cid7961",
}
uvs = {
0xE0100: {0x82A6: "cid1142"}, # non-default
0xE0101: {0x82A6: None}, # default, refer to cmap
}
print(glyphsFromText("\u82A6\U000E0100", cmap, uvs))
print(glyphsFromText("\u82A6\U000E0101", cmap, uvs)) |
Alright, more thoughts on how to add a cmap file to the UFO format, as well as support for Unicode Variation Sequences. Cmap:
Unicode Variation Sequences:
|
See also googlefonts/ufo2ft#162 |
I like the idea of simple space-separated text files. The format is simple enough that the parser can be a one-liner. As you already noted, if this cmap.txt file maps from unicode values to glyph names, then the "primary" unicode value for a glyph can no longer be defined, so APIs like this I agree the idea of a primary unicode value for a glyph is flawed, but if we really wished to keep it around, we could say this cmap.txt is not required to be sorted by unicode value, and the first mapping that appears for a given glyph name in this ordered cmap list is considered the "primary" unicode value for that glyph. I don't know if it's worth it, though. |
I think the long term consequence is that glyphs will eventually neither have a To loosen the sort requirement is a nice idea if we indeed must hold on to the notion of "primary unicode", but can some other sorting requirement be invented that ensures a deterministic order? I'd hate it if various tools would output equivalent but differently sorted cmap files. |
how about sorting cmap.txt by the glyph name instead of the unicode value, then within the group of mappings that share the same glyph name, the order is user-defined? (note: i'm still leaning towards simplicity [sorting by unicode] and deprecating the notion of "primary" unicode value) |
That could work.
Yeah. I'm curious to hear others about this issue. As I wrote before, I think any breakage can only come from code using |
Could someone open another issue for Unicode Variation Sequence support? I don't want the discussion of that to be hard to find in the future. |
Note how I wrote "a font" and not "an OpenType font". The way a cmap works for any kind of font engine is that it maps unicode values to glyphs, and never the other way around. It's such a fundamental thing.
This is my weakest argument, so let me give in on that one :)
The fact that ambiguities exist elsewhere in the spec is no reason to not try and avoid it here, especially if the solution is so trivial and obviously correct. Inherent correctness is better than correctness-that-needs-to-be-verified.
That was half in jest. On the one hand we can easily make a tab separated text work even when tab chars can occur in glyph names, on the other hand I don't think it's all that reasonable to allow such invisibles to occur in glyph names. How about NUL characters? Return/newline? Anything < 0x20, really. |
Ambiguities can be created in a plain text file:
Any spec is going to have to deal with these issues. Don't get me wrong. I love plain text files and my first thought when "we need a cmap" came up was
I thought there was something in the contents.plist spec about excluded characters, but it looks like it is only in the example name to file name algorithm. Yikes. The spec should be changed. I'll open an issue for that. |
True, but we were arguing about mapping Even with a text file, it's easy to say (and verify) "the first column must be a unique value", but it's a lot harder if we spec it the other way around. Sure, not impossible, just less elegant and less logical. For years we've been thinking like "glyphs have unicode values". I'm arguing that it's time we should change our thinking towards "unicode code points map to glyphs", as that's a more realistic model of how fonts actually work. Your point about custom formats is well taken. The data we're talking about here is quite flat, and apart from the dictionary aspect that guarantees keys to be unique, the nested plist structures don't buy us much. Sure, it can be made to work by encoding unicode keys as hex strings, but I'm arguing that that additional layer of encoding on top of plist reduces the benefit of the plist standard. But either way, let's first focus on the next point:
Yes. This is probably the most important question in this discussion: what breaks if we stop guaranteeing the order of |
That's a very good point.
I don't know for sure, but I've been thinking about it. In my own work, I tend to use
is easier and more easy to understand than:
Ease of input aside, I looked through some of my code and it looks like I use the "primary Unicode" assumption mostly in interface stuff. (Here's a place in defcon that gets used for this.) The impact of a change to this behavior will only potentially apply to double mapped glyphs and even then it won't be a mission critical change. So, I can't speak for everyone, but I think the impact on my code will be minor. A point that I've been waiting for someone to bring up is that the first item in the UFO Design Philosophy is "The data must be human readable and human editable." and I'd like to see what a Python reader and writer (that assumes that #80 will be put in place) would look like for the proposed format. |
Here's a super minimal dumper/loader. It assumes glyph names don't contain control chars. from io import StringIO
def cmapdump(cmap, f):
for uni, glyphName in sorted(cmap.items()):
f.write("%04X\t%s\n" % (uni, glyphName))
def cmapload(f):
cmap = {}
for line in f:
if line and line[-1] == "\n":
line = line[:-1]
uni, glyphName = line.split("\t", 1)
uni = int(uni, 16)
cmap[uni] = glyphName
return cmap
cmap = {
0x30: "zero",
ord("a"): "a",
ord("b"): "b",
ord("z"): "z z z z",
0x1e0000: "å ß é"
}
f = StringIO()
cmapdump(cmap, f)
tabSepData = f.getvalue()
print(tabSepData)
f.seek(0)
cmap2 = cmapload(f)
assert cmap == cmap2 |
Could someone write this up and PR it? Would be good to look at wording to comment on, as I think the general consensus is that this should happen. |
Yes, I think that's a good set of things for 3.1. |
I’m coming form the public.skipExportGlyphs discussion on the glyphsLib repo. And was pointed to #77 . You have a very long discussion about a very specific problem that is caused by a structural weakness of the file format. And if that would be solved properly, we would not need that big change in the first place. I think any information should be stored as closed to all other related information as possible and if something is changed, it should result in the least possible changes elsewhere in the data structure. So if a glyph is deleted, it shouldn’t leave info in to many places (cmap, kerning classes) (there is a weak point in my argument with components, I know). There are more properties that have the same problems, the export state is one of it. You are thinking about changing the structure quite a bit so why not allow discussion about the structure? I suggested that before but if we are speaking about a new version I’ll try again. I think there are several layers of information needed.
This solves quite a lot of the ambiguities that are in the current spec. You where concerned by the overhead of producing a unicode to glyph mapping. The current structure has a so much bigger overhead of producing a single glyph from a designspace. One needs to go through all layer folders in the .ufo and then go through all possible extra .ufos to find all intermediate masters (and again all its layers). So if you have a font with a bunch of extra layers and intermediate masters, you need to read the content of a couple thousand folders just to compile one glyph. |
|
But without some serious changes we will be stuck. |
That is simply not true, unless you're exaggerating to new levels of hyperbole :) To get the data needed for one glyph you don't need to look up more glyph data items (files) then there are masters. It's simply O(N) for N masters. I don't think N will ever go into the thousands. Let alone that there will be thousands of folders involved. To get a cmap-like data structure so I can typeset something (anything) I need to parse ALL glyphs from the default layer. And that's very expensive if the font is large. It's O(N) for N number of glyphs in the font. One of the cool properties of the UFO format is that you can read most of it lazily. Unicode values being stored in the glyph data limits this ability, hence this proposal. |
I do not exaggerate. With a very typical setup from a designer working in Glyphs where each glyph has a main glyph and a background and maybe some extra layers (copies or brace/brackes). If that is stored in a ufo3, it up with a couple hundred .glif-folders (most layers have individual names). All of those folders have to be parsed to collect all .glifs that belong to one glyph. |
In the designspace it is specified which layers are used for which masters, so 99.9% of those layers are not needed to build a glyph, and don't need to be parsed. |
How much of a use case is it to use a ufo to typeset something. The sfnt format is optimised for that. How often do you need a glyph from a .ufo by unicode lookup? Not during design time (the designer likes to see all of them) and during production (where the glyphs are probably accessed by index or name). |
I need it all the time, otherwise I wouldn't have posted this proposal. |
I’m interested what you need if for. |
For quickly previewing text. I don't want to parse 65000 .glif files to be able to preview "The quick brown fox" or whatever sample string. |
For UFO4 we should consider removing the
<unicode>
element from the glif format, in favor of a ufo-global character mapping file, sayfont.ufo/cmap.plist
, which would map unicode values to glyph names.The text was updated successfully, but these errors were encountered: