Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 𝟏, 𝟎, 𝟙, 𝟘 into valid identifiers for DSLs #26808

Closed
jlperla opened this issue Apr 14, 2018 · 16 comments
Closed

Make 𝟏, 𝟎, 𝟙, 𝟘 into valid identifiers for DSLs #26808

jlperla opened this issue Apr 14, 2018 · 16 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request unicode Related to unicode characters and encodings
Milestone

Comments

@jlperla
Copy link
Contributor

jlperla commented Apr 14, 2018

Looking at https://docs.julialang.org/en/latest/manual/unicode-input/#Unicode-Input-1 There are a few identifiers that would make excellent identifiers for linear algebra and probability DSLs.

U+1D7CE 𝟎 \bfzero Mathematical Bold Digit Zero
U+1D7CF 𝟏 \bfone Mathematical Bold Digit One
U+1D7D8 𝟘 \bbzero Mathematical Double-Struck Digit Zero
U+1D7D9 𝟙 \bbone Mathematical Double-Struck Digit One

Note that this is conservative in leaving as many other of the unicode numbers as invalid identifies. In particular, \bsanszero and \bsansone look similar, but are left as invalid identifiers for now.

The main use-case for these is to be able to add in automatically reshaping matrices/vectors of 1s and 0s into https://github.com/JuliaArrays/FillArrays.jl in the spirit of the UniformScaling operator, currently denoted by I. Of course, this library would not intend to lay claim to that notation, but would want to use it. The 𝟘 and 𝟙 might be useful for people who wish to use const 𝟙 = 𝟏 to match their latex notation, or could allow writing a new indicator functions, etc. I know I would use 𝟙(a > b) for that to match algebra.

@ararslan ararslan added the unicode Related to unicode characters and encodings label Apr 14, 2018
@digital-carver
Copy link
Contributor

I can see the appeal of the idea, but I think there's too little benefit for the potential readability and maintanence costs with this. Between font variations and (anti-)aliasing and rendering choices and syntax highlighting, the distinctions between the different zeros (0 𝟎 𝟘) or ones (𝟙 1 𝟏) can get pretty blurry. The idea of potential gotchas in such basic entities as 0s and 1s (and the confused stackoverflow questions resulting from them) is not an appealing prospect.

@jlperla
Copy link
Contributor Author

jlperla commented Apr 14, 2018

I think you can make that case about almost all unicode characters that have a similar ascii character. Whether it makes sense in a particular case or not is a very reasonable question, and library specific.

In libraries like ApproxFun.jl, they use symbols like 𝒟, which looks like a D matches the math notation of using script to denote differential operators.

The only difference with what I am suggesting is that (right now) library writers don't have the option to make an alias to variable names that start with the fancy number looking characters.

If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 14, 2018

There are three choices here:

  1. Disallow all digit-variant characters entirely (what we do now).
  2. Allow digit-variant characters to be used as letters, distinct from the digits they correspond to.
  3. Allow digit-variant characters to be used as if they were simply the plain digit, i.e. make 𝟘 another way of writing 0.

The last option seems confusing and fairly pointless to me—unlike characters like μ and µ, which are different Unicode characters that look exactly alike, these are not likely to be somehow accidentally input when plain digits were intended. Why allow weird digits variants when literally every keyboard ever created has plain digits directly on it? The only way 𝟘 is likely to end up in a program is if someone intended to enter it.

The current behavior of disallowing digit variants entirely seems like a waste of potentially nice syntax. I have yet to encounter a font where these digits variants render and are not visually distinguishable from the corresponding digits.

That leaves option 2: allowing digit-variant characters to be used as letters, which is what this issue proposes. I can understand that people might now want to use these bindings, which is fine—in that case, don't use them. But why should we prevent people who want to from doing so? Especially given that the only other potential use for them is not really sensible.

@digital-carver
Copy link
Contributor

I think you can make that case about almost all unicode characters that have a similar ascii character.

True, that's why I mentioned "such basic entities as 0s and 1s". 'Is this identifier a 𝒟 or a D' is a very different sort of question from 'is this thing here a literal or an identifier'. It's a small mental cost when going through a codebase, but such costs add up pretty quickly.

If this was changed, then the discussion could come to what you are bringing up: are introducing those aliases a good idea (since they never should be required). Your perspective is reasonable, but it may be domain specific

I'm a fan of DSLs and would in theory love to have custom infix operators (#16985) and even custom infix named functions, hoping the users use them wisely. But sometimes the guardrails have to be in the language, and in my opinion this is one of those cases.

I can understand that people might now want to use these, in which case, simply don't. But why should we prevent people who want to from doing so?

The same reason the codepoints were restricted in the first place (#5936) - code gets passed down and across teams and people, and sometimes it's more important to prevent "crazy things" being introduced by someone, than to provide a minor nicety.

@dlfivefifty
Copy link
Contributor

As far as I can tell, any argument that this is confusing applies equally to (\euler). So whatever discussion led to changing e to in Base applies here.

@JeffBezanson
Copy link
Member

Agreed; we're way past the point of having any sort of policy against potentially-confusable characters. I agree with Stefan that when fonts have 𝟘 and 𝟙 they tend to be more distinguishable than some other examples like e and ℯ.

@StefanKarpinski
Copy link
Member

The same reason the codepoints were restricted in the first place

The reason to restrict code points was to allow for implementing sane uses of code points in the future without breaking code, not to prevent people from doing silly things. If people want to write unreadable code, they will, no matter what we do to try to prevent it.

I think the de facto policy with potentially-confusable characters is that we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other and not be easily able to tell that this is what has happened. The normal "e" versus Euler's "ℯ" fails this test on both counts: there's little chance that anyone will have input "ℯ" by accident when they meant "e" since "e" is on every keyboard and "ℯ" is on none; they also look fairly distinct in most fonts so even if someone managed to do this somehow, they'd be able to notice what's going on. The case of "μ" and "µ" satisfies this criterion since neither character is on a standard keyboard and some input methods give you one while others give you the other and they look identical so it's extremely hard to discover that this is what's going on after the fact. Applying this test to the "1" versus "𝟙" case leads to the same conclusion as "e" versus "ℯ"—i.e. that they should be considered distinct characters.

@digital-carver
Copy link
Contributor

we identify characters that are easily confused both on input and appearance so there's a real chance that someone may input one when they intended to input the other and not be easily able to tell that this is what has happened

My concern was about later readability than about ambiguity during input, "code is read a lot more than it's written" and all that. But since this is probably going in, can we have it so that there's one canonical identifier zero (not multiple) to go alongside the one canonical literal 0 (and similarly for 1)? My vote is for the \bbzero and \bbone to be the allowed identifiers, since they're easier to distinguish visually from 0 and 1 (especially in the presence of syntax highlighting, which often makes a bold vs non-bold distinction not so clear).

@dlfivefifty
Copy link
Contributor

I see no reason to limit this to just one, when so many of the "1"s are easily distinguished. No one is going to confuse any of the following for each other or for 1 and so at the very least they all should be legitimate identifiers: 𝟙, ₁, ❶, ⓵, ①, 1️⃣

@sbromberger
Copy link
Contributor

ref: #10762

@jlperla
Copy link
Contributor Author

jlperla commented Aug 6, 2019

@JeffBezanson @StefanKarpinski (cc @dlfivefifty ) I realized that a feature freeze is coming soon and was wondering if you would still support having a PR that implements this? It would be very nice to sneak into the 1.3 release.

@dlfivefifty
Copy link
Contributor

For the record, 1.3 has a lot of exciting stuff in it already, and so postponing this to 1.4+ makes sense to me.

@jlperla
Copy link
Contributor Author

jlperla commented Aug 6, 2019

Oh for sure. This would not be the highlight of the release by any means! But if it is a low "cost" and low probability of side effect issue, it would mean I can write some cool DSLs 6 months earlier.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Aug 7, 2019
@JeffBezanson
Copy link
Member

Triage is ok with this.

@JeffBezanson JeffBezanson added this to the 1.3 milestone Aug 8, 2019
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Aug 8, 2019
@StefanKarpinski
Copy link
Member

Explicitly, triage is ok with option 2: Allow digit-variant characters to be used as letters, distinct from the digits they correspond to. Now it merely needs an implementation.

@StefanKarpinski StefanKarpinski added the help wanted Indicates that a maintainer wants help on an issue or pull request label Aug 8, 2019
@JeffBezanson
Copy link
Member

Fixed by #32838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

7 participants