Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong LaTeX-Unicode mapping of \varepsilon #14751

Closed
hmarthinsen opened this issue Jan 21, 2016 · 20 comments
Closed

Wrong LaTeX-Unicode mapping of \varepsilon #14751

hmarthinsen opened this issue Jan 21, 2016 · 20 comments
Labels
unicode Related to unicode characters and encodings

Comments

@hmarthinsen
Copy link

\varepsilon is currently mapped to ɛ (U+025B latin small letter open e). This is wrong. The correct mapping is ε (U+03B5 greek small letter epsilon).

@ivarne
Copy link
Member

ivarne commented Jan 22, 2016

As you can see in /base/latex_symbols.jl this mapping is autogenerated from https://www.w3.org/Math/characters/unicode.xml, and as far as I can tell the script faithfully copies the mapping from the source. There might be a bug in the w3 mappings though.

cc: @stevengj

@ivarne ivarne added the unicode Related to unicode characters and encodings label Jan 22, 2016
@hmarthinsen
Copy link
Author

I think this may be because \varepsilon is assigned to several Unicode characters in https://www.w3.org/Math/characters/unicode.xml, but the wrong Unicode character is selected to represent it in https://github.com/JuliaLang/julia/blob/master/base/latex_symbols.jl. Does this trigger println("# duplicated symbol $L ($id)") on line 29? Maybe an exception should be added as has been done for \perp and \bot on line 32.

This bug affects JunoLab/atom-latex-completions#3

@stevengj
Copy link
Member

Seems reasonable to add an exception here and update the table. For \epsilon, we are using U+03F5, and that has compatibility decomposition U+0395, reinforcing that the latter is the natural choice for \varepsilon.

Also affects ipython/ipython#6380, as well as other Julia editor plug-ins.

@jiahao
Copy link
Member

jiahao commented Jan 26, 2016

Our choice is not so much wrong as a reflection of historical inconsistencies over the proper mapping of epsilon.
The W3C's XML Entity Definitions even has a special entry documenting the mess that is code point mappings for epsilon.

In this case it looks like we just happen to pick MathML's mappings, which are at variance with other standard definitions for STIX and XML/MathML2.

Digging further:

  1. 10/2003 - "SGML Public entity sets for mathematics and sciences" (pdf), p.80: \varepsilon is explicitly mapped to U+025B. Furthermore they note on p. 9 a discrepancy between MathML and Stix consortium's glyph tables:

    Entity: [epsi][ISOGRK3]
    MathML [U003B5][GREEK SMALL LETTER EPSILON]
    Stix [U003F5][GREEK LUNATE EPSILON SYMBOL]
    epsilon variants. MathML wrong?
    Entity: [epsiv][ISOGRK3]
    MathML [U0025B][LATIN SMALL LETTER OPEN E]
    Stix [U003B5][GREEK SMALL LETTER EPSILON]
    epsilon variants. MathML wrong?
    
  2. 10/2003 W3C MathML2 and ISOGRK3 recommends instead U+03B5 for \varepsilon.

  3. 11/2010 - Unicode Technical Note 28 recommends also U+03B5 for \varepsilon.

@wsshin
Copy link
Contributor

wsshin commented Mar 26, 2016

I definitely agree with other people on making an exception here for a few reasons.

If we compile a LaTeX document containing $\varepsilon$ and check the Unicode of the generated character, it is U+03B5. Considering that most Julia users input Greek letters using LaTeX commands, it is reasonable to expect Julia to produce the same Unicode character as LaTeX for \varepsilon, but currently this is not the case. Julia produces U+025B as other people mentioned.

Personally, I am inputting Greek letters in Julia directly using the Greek keyboard layout instead of using the LaTeX command because that is faster. The "e" key in the Greek keyboard also generates U+03B5 rather than U+025B.

This inconsistency could be a source of errors that are hard to track down, as I just experienced. I was extending someone else's code who used \varepsilon to define the variable ɛ. When extending his code, I used the Greek keyboard key "e" to access this variable. Unfortunately, Julia's \varepsilon and the Greek keyboard's "e" generated different Unicode characters, so I was getting an UndefVarError. Because these two different characters looked the same, it took a while to figure out what was going on.

@nalimilan
Copy link
Member

This inconsistency could be a source of errors that are hard to catch, as I just experienced. I was extending someone else's code who used \varepsilon as a variable. When extending his code, I used the Greek keyboard to access this variable. Unfortunately, Julia's \varepsilon and the Greek keyboard's "e" generated different unicode characters, so I was getting an UndefVarError. Because these two different characters looked the same in the Juypter notebook, it took a while to figure out what was going on.

This problem will still happen even if we change the mapping (though it will be less frequent). This is the same situation as with mu vs. micro. See #5903.

@Godisemo
Copy link

+1

@StefanKarpinski
Copy link
Member

@Godisemo: it's unclear what you're +1 ing here.

@Godisemo
Copy link

@StefanKarpinski Yeah, I realised that now when you pointed it out. I'm +1 ing the fact that this really is an issue and that I support the proposition to change the \varepsilon expansion from ɛ (https://en.wikipedia.org/wiki/Open-mid_front_unrounded_vowel) to ε (https://en.wikipedia.org/wiki/Epsilon).

@Godisemo
Copy link

The only thing I see that complicates things are that we have to make the same change in all editor plugins that people use. Personally, I only use vim and atom, so I don't know what other plugins are available for other editors.
https://github.com/JuliaEditorSupport/julia-vim/blob/master/autoload/julia_latex_symbols.vim
https://github.com/JunoLab/atom-latex-completions/blob/master/completions/completions.json

@stevengj
Copy link
Member

I think the best solution would be to first implement a custom normalization so that ɛ (U+025B latin small letter open e) and ε (U+03B5 greek small letter epsilon) are treated as equivalent in identifiers. Once that is done, we can gradually migrate editor plugins without breaking code. See JuliaStrings/utf8proc#11

(My main concern is that this opens a can of worms, since there are potentially a lot of custom normalizations we might want.)

@StefanKarpinski
Copy link
Member

I think if we're conservative and take the custom normalizations on a case-by-case basis, it should be ok. The only major danger of each normalization is that someone might be using both letters in a pair that we start to normalize in otherwise indistinguishable ways, breaking code. However, any code that does that is either accidentally broken and would be fixed by the normalization or intentionally obfuscated, which I don't think is a major concern. So the criterion for custom normalization should be at least: would it be crazy to use these two characters in otherwise indistinguishable ways.

@Godisemo
Copy link

Maybe we could issue a warning if both versions are detected in the same code?

@stevengj
Copy link
Member

If we go the normalization route, I think we would just have a list of codepoints that we treat as (permanently) equivalent, with no warning. i.e. different ways of inputting "ε" should all be equally valid.

@Godisemo
Copy link

I don't think treating them the same as a long term plan is a good idea. What if we start doing this for characters that look the same but are totally different, for example Α (capital alpha) and A. What if they look different in in other fonts? I just think we should use the correct characters for the specific latex expansions.

@stevengj
Copy link
Member

Normalization of confusable characters is pretty well established in Unicode. Python 3 does much more aggressive (NFKC) normalization than us, for example.

@Godisemo
Copy link

Godisemo commented Nov 29, 2016

It gets a bit funny though when you do a search and or replace in your source file since no editor i know of treats visually similar characters as equal. Epsilon and varepsilon though are treated equal since they are the same character.

@stevengj
Copy link
Member

We already do NFC normalization, so probably that bridge has already been crossed. And, as I said, Python 3 already does NFKC normalization and I don't see people complaining

@Godisemo
Copy link

Yeah, maybe you are right. It would definitely be convenient to treat the visually ambiguous characters as the same. If normalization is a thing then maybe the editors should change instead.

stevengj added a commit to stevengj/julia that referenced this issue Dec 1, 2016
stevengj added a commit to stevengj/julia that referenced this issue Dec 26, 2016
stevengj added a commit to stevengj/julia that referenced this issue Dec 29, 2016
stevengj added a commit to stevengj/julia that referenced this issue Jan 4, 2017
@tkelman tkelman closed this as completed in 62c423b Jan 6, 2017
@stevengj
Copy link
Member

stevengj commented Jan 6, 2017

At some point after the 0.6 release, we should push this change to the various editor plugins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

8 participants