Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Char no longer a subtype of Integer #8816

Merged
merged 1 commit into from
Oct 27, 2014
Merged

Conversation

jakebolewski
Copy link
Member

I feel that this is surprising behavior, and this issue has come up in #5844 and #8569.

All the tests pass except for readlm and parallel (which hangs forever). I guess I'm looking for a up / down vote on this (breaking) behavior before fixing the last remaining issues.

@jakebolewski jakebolewski added the kind:breaking This change will break code label Oct 26, 2014
@johnmyleswhite
Copy link
Member

+1 to this idea

@johnmyleswhite
Copy link
Member

If it helps to argue for this idea, I'd say that codepoints are a convenience numeric encoding for what is clearly a categorical variable that doesn't possess any meaningful numerical properties.

@StefanKarpinski
Copy link
Sponsor Member

I agree but I also don't see why at least comparisons with integers can't work.

@jakebolewski
Copy link
Member Author

If it helps to argue for this idea, I'd say that codepoints are a convenience numeric encoding for what is clearly a categorical variable that doesn't possess any meaningful numerical properties.

I agree but we don't really have a well defined interface to support this concept. @JeffBezanson has pointed out that Ptr's are similar in this regard.

I also don't see why at least comparisons with integers can't work.

I wanted to be conservative, integer comparison can be trivially added back now.

@StefanKarpinski
Copy link
Sponsor Member

So having thought about this a bit more I think that chars should not be numbers but support these very specific operations:

  • comparisons with numbers
  • Char - Char = Int
  • Char + Int = Char

That covers all the useful operations you generally want to do with characters as integers but is specific enough to avoid too much confusion.

@StefanKarpinski
Copy link
Sponsor Member

Oh, in see that you have exactly those in here.

@jakebolewski
Copy link
Member Author

I left out comparisons in the beginning as it was useful for finding what parts of Base this change touched.

I'll point out thatPtr does not allow comparisons with Integers (the other two are defined). From my experience, character equality to integers (==) is a source of bugs, but we have egal as a fallback in this case.

@StefanKarpinski
Copy link
Sponsor Member

Why do you think it's a source of bugs? There's far less use for comparisons of pointers than for characters.

@jakebolewski
Copy link
Member Author

Off the top of my head, I can't think of any two types in Base that define an equality relationship and do not share a common abstract super-type.

@StefanKarpinski
Copy link
Sponsor Member

They do – Any!

@quinnj
Copy link
Member

quinnj commented Oct 26, 2014

The grisu code uses Char == Int comparisons. I wouldn't be surprised if other printing/showing code utilized it somehow.

Also Period types in the Dates module can be equal to Real.

Just a couple of data points.

@jakebolewski
Copy link
Member Author

Ok fair enough :-)

@quinnj are you sure? If you turn off character / integer comparison then all the grisu tests still pass. That might signal that are parts that are not being tested at the moment but I find that hard to believe.

@quinnj
Copy link
Member

quinnj commented Oct 26, 2014

Ah, perhaps they don't. They originally did to mirror the C++ implementation, but I did end up changing over to using raw Array{Uint8,1} buffers instead. I still would have thought there'd have been a few comparisons in there at some point. I guess if not, all the better, right?

@jakebolewski jakebolewski force-pushed the jcb/char_not_integer branch 2 times, most recently from 5de9a78 to eed3724 Compare October 27, 2014 16:08
Char now supports a limited set of "integer like" behavior.

* comparisons with integers
* Char - Char = Int
* Char + Int = Char

update docs and add a NEWS.md entry
@jakebolewski jakebolewski changed the title WIP: Make Char no longer a subtype of Integer Make Char no longer a subtype of Integer Oct 27, 2014
@jakebolewski
Copy link
Member Author

Travis is passing now and I've updated the docs. Should comparison / equality operations be defined for all Numbers or just Integers?

@ivarne
Copy link
Sponsor Member

ivarne commented Oct 27, 2014

Nice to get rid of +(::Char, ::Char) with this PR!

We'll get bug reports if people think '\n' == 10.0 makes sense. Char isn't a Number, so it shouldn't behave like one if people don't complain about missing methods that they need for some reason. #7512 will also help.

@StefanKarpinski
Copy link
Sponsor Member

I agree with @ivarne's take – let's stick with just allowing comparisons to integers for now.

@jakebolewski
Copy link
Member Author

That's what I have now. Good to merge then?

@nalimilan
Copy link
Member

I support @jakebolewski's position that comparing ints and chars is more a source of bugs than anything. Inequalities like '0' != 0 can be particularly confusing. Note that @quinnj stated above he used it, before realizing he had found a better way of doing that: Uint8. I think it may be better to force users to choose between char or int behavior, by using Char or Uint8.

@StefanKarpinski
Copy link
Sponsor Member

Perhaps, but we can make this less disruptive change first and then consider whether to do that. I'm good with merging this now.

@jakebolewski
Copy link
Member Author

Actually, no code in Base relies on Char / Integer comparison to be defined after this PR, so we can trivially switch to that behavior in the future if we so choose.

jakebolewski added a commit that referenced this pull request Oct 27, 2014
Make Char no longer a subtype of Integer
@jakebolewski jakebolewski merged commit 47dac56 into master Oct 27, 2014
@jakebolewski jakebolewski deleted the jcb/char_not_integer branch October 27, 2014 22:08
@@ -40,14 +56,10 @@ promote_rule(::Type{Char}, ::Type{Uint128}) = Uint128

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep these bitwise operations for Char?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful for masking.

@IainNZ
Copy link
Member

IainNZ commented Oct 28, 2014

PkgEval: This unsurprisingly broke a few things, including people doing something involving getindex, Char and an Array{Bool}

@IainNZ
Copy link
Member

IainNZ commented Oct 28, 2014

TextPlot's error is ERROR:|has no method matching |(::Char, ::Int64), for reference.

@IainNZ
Copy link
Member

IainNZ commented Oct 28, 2014

DataStructures:

ERROR: test error in expression: sort(collect(keys(d))) == ['a':'z']
`isless` has no method matching isless(::Char, ::Char)

(Does this mean you can't sort strings?)

@kmsquire
Copy link
Member

No, you just can't sort Chars.

@IainNZ
Copy link
Member

IainNZ commented Oct 28, 2014

If you can write 'a':'z' and it produces the letters 'a','b',...,'z', it implies (to me) there is some sort of sensible ordering to apply

@Jutho
Copy link
Contributor

Jutho commented Oct 28, 2014

if Char - `Char produces an integer, I would also argue that there is some order, given by the sign of that integer.

On 28 Oct 2014, at 15:47, Iain Dunning notifications@github.com wrote:

If you can write 'a':'z' and it produces the letters 'a','b',...,'z', it implies (to me) there is some sort of sensible ordering to apply


Reply to this email directly or view it on GitHub #8816 (comment).

@jiahao
Copy link
Member

jiahao commented Oct 28, 2014

julia> isless('a', 'c')
ERROR: `isless` has no method matching isless(::Char, ::Char)

julia> 'a' < 'c'
true

I forget how isless and < are supposed to work again (is one a partial order and the other a total order?), but both comparisons probably should work.

@ivarne
Copy link
Sponsor Member

ivarne commented Oct 28, 2014

According to the documentation we should rather implement isless(::Char, ::Char) than <(::Char, ::Char) now that Char isn't a numeric type anymore. There is also a default for < that calls isless.

help?> isless
Base.isless(x, y)

   Test whether "x" is less than "y", according to a canonical
   total order. Values that are normally unordered, such as "NaN",
   are ordered in an arbitrary but consistent fashion. This is the
   default comparison used by "sort". Non-numeric types with a
   canonical total order should implement this function. Numeric types
   only need to implement it if they have special values such as
   "NaN".

ivarne added a commit that referenced this pull request Oct 28, 2014
@andreasnoack
Copy link
Member

Is "canonical total order" CS 101? I know what a total order is, but the canonical part is not obvious to me and Wikipedia's definition:

in the context of sorting, a standard order, typically a total order, obeying certain rules, e.g. lexicographic order for strings

didn't really help as "standard order" seems to be a non-standard term and "obeying certain rules" is rather vague.

@jakebolewski
Copy link
Member Author

@ivarne these errors are gaps in test coverage, so any fix should include tests.

@ivarne
Copy link
Sponsor Member

ivarne commented Oct 28, 2014

Definitely true that. Seems like we have a pretty glaring hole in test coverage for Char. Char only occurs 10 times in /tests/*.

asci = true
d = sbuff.data
for idx in 1:length(d)
(d[idx] < 0x80) ? continue : (asci = false; break)
(char(d[idx]) < char(0x80)) ? continue : (asci = false; break)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't have thought to do this, but now I'm wondering if I should -- what's the benefit?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this doesn't make sense. This code is really comparing bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because for UTF32 strings this is a Char / Integer comparison, and I removed those definitions some intermediate point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- good to know that's what UTF32String looks like under the hood. Thanks.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got confused because this code is kind of wrong. If it's supposed to work for arbitrary Strings it shouldn't be accessing .data.

@tonyhffong
Copy link

What should be the replacement for the logic for using ncurses where we need to test

c == char(-1)

It's now giving me InexactError()

@ivarne
Copy link
Sponsor Member

ivarne commented Nov 5, 2014

Maybe c == '\Uffffffff' or the somewhat convoluted c == reinterpret(Char,Int32(-1))?

I also have trouble understanding why we think of Char as unsigned. Go makes interprets its Char a signed quantity and all the negative numbers are invalid code points anyway.

@tonyhffong
Copy link

yeah just successfully tested that 5 secs ago. It's not a big deal. Thanks.

@ivarne
Copy link
Sponsor Member

ivarne commented Nov 5, 2014

typemax(Char) will also work (as long as Char is unsigned).

@tkelman
Copy link
Contributor

tkelman commented Nov 5, 2014

Re: signedness of char, I'll bring up #7472 and #7303 in the spirit of unfun, do-not-want, things to watch out for, where char being (sometimes? depending on platform?) a signed fixed-width type caused nastiness like introducing unicode into the parser resulting in the windows binaries no longer working on XP...

@ivarne
Copy link
Sponsor Member

ivarne commented Nov 5, 2014

@tkelman How is issues with the signdness of the C 8 bit datatype char relevant for how we should treat integer conversion to the Julia 32 bit type Char?

I don't say that the signdness matters, I just want there to be reason why we behave differently than Go, which have very similar Unicode semantics with Julia.

@tkelman
Copy link
Contributor

tkelman commented Nov 5, 2014

Just precedent for unforeseen consequences of a related, but not identical, issue.

@StefanKarpinski
Copy link
Sponsor Member

I'd want to know what the sign bit would mean. One option would be to serve as an "invalid" flag while allowing some payload.

@pao
Copy link
Member

pao commented Nov 5, 2014

Not sure if it would come into play here, but Java rejects (valid) byte patterns expressed as hex literals because it treats its byte type as signed:

private static final byte[] sync = {0xFF, 0x12};

is not permitted. This is annoying and I'd appreciate its Julia equivalent working independent of char's signedness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:breaking This change will break code
Projects
None yet
Development

Successfully merging this pull request may close these issues.