Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AbstractChar supertype of Char #26286

Merged
merged 8 commits into from
Mar 7, 2018

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Mar 1, 2018

Fixes #25302. Wherever practical, functions that took ::Char arguments are changed to accept any ::AbstractChar, with fallbacks defined as necessary.

For T<:AbstractChar, UInt32(c::T) is defined to return the Unicode codepoint represented by c (or throw an error if c lies outside of Unicode), and T(x::UInt32) should create a T from the Unicode codepoint x (or throw an error … T may represent only a subset of Unicode). This makes it possible to define generic fallbacks for comparison to Char, output, etcetera. Even ancient character sets like EBCDIC have a well-defined injective mapping into Unicode, so I don't think we sacrifice any useful generality this way.

@stevengj stevengj added the unicode Related to unicode characters and encodings label Mar 1, 2018
@stevengj stevengj requested review from StefanKarpinski and Keno March 1, 2018 21:54
@stevengj
Copy link
Member Author

stevengj commented Mar 1, 2018

An exception to my comment above about injectively mapping other character sets into Unicode is the Han unification, in which different character variants in non-Unicode Japanese encodings are mapped to the same codepoint in Unicode (which viewed them as "font" variations rather than semantically distinct characters).

If you want to define an AbstractChar type representing TRON or similar, however, I still don't see a real problem. Conversion to UInt32 will still produce a Unicode codepoint for people who need Unicode data. You will presumably need to implement custom methods for comparison of one TRON character to another, I/O to non-Unicode streams, or any other operation where conversion to Unicode would be lossy, but you probably would have implemented those methods anyway in order to avoid the performance cost of going through Char.

@Keno
Copy link
Member

Keno commented Mar 1, 2018

One thing we had discussed was whether continuing to use UInt32 to represent a unicode codepoint makes sense or whether, with the introduction of AbstractChar, we shouldn't also have Codepoint <: AbstractChar for that.

@StefanKarpinski
Copy link
Member

CodePoint not Codepoint – it's two words.

@Keno
Copy link
Member

Keno commented Mar 1, 2018

SuRe ;)

base/char.jl Outdated
@@ -24,6 +61,11 @@ function isoverlong(c::Char)
is_overlong_enc(u)
end

# fallback: other AbstractChar types, by default, are assumed
# not to support malformed or overlong encodings.
ismalformed(c::AbstractChar) = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own AbstractChar type, it's good to think about this question precisely.

Copy link
Member Author

@stevengj stevengj Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the proposed AbstractChar types I've ever seen, other than Char, don't support malformed characters. So it makes sense to me to make this the default. You can always override it.

Another reason to make this the default is that if ismalformed(c) returns true, you need to define additional methods to decode the character.

base/char.jl Outdated
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y))
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y)
hash(x::AbstractChar, h::UInt) =
hash_uint64(((UInt32(x) + UInt64(0xd060fad0)) << 32) ⊻ UInt64(h))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use reinterpret, so that invalid codepoints can be hashed (just like for Char)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractChar need not be reinterpretable to UInt32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you could convert to Char first...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a good point I guess, since equality is defined in terms of comparison, to Char, hashing should match.

* Only the index of the first code unit of a `Char` is a valid index
* The encoding of a `Char` is independent of what precedes or follows it
* Each `AbstractChar` in a string is encoded by one or more code units
* Only the index of the first code unit of a `AbstractChar` is a valid index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an". Same below.

"try map(f, collect(s)) or a comprehension instead"))
write(out, c′::Char)
write(out, c′::AbstractChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require all AbstractChar implementations to implement write so that it uses UTF-8? This code relies on this assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fallback write method that works via conversion to Char.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But I guess it should be mentioned in the docstring for AbstractChar too? It could sound natural to somebody implementing e.g. a Latin1Char to write it in ISO-8859-1, which wouldn't be correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could define print to always output UTF-8, and write to output a raw encoded value.

Copy link
Member Author

@stevengj stevengj Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I tend to think that the I/O encoding should be a property of the stream, not the string or character type, because usually all strings in a given stream should use the same encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.

@@ -410,7 +410,7 @@ If `count` is provided, replace at most `count` occurrences.
or a regular expression.
If `r` is a function, each occurrence is replaced with `r(s)`
where `s` is the matched substring (when `pat`is a `Regex` or `AbstractString`) or
character (when `pat` is a `Char` or a collection of `Char`).
character (when `pat` is a `AbstractChar` or a collection of `AbstractChar`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"an"

@@ -29,8 +29,7 @@ There are a few noteworthy high-level features about Julia's strings:
a string argument, you should declare the type as `AbstractString` in order to accept any string
type.
* Like C and Java, but unlike most dynamic languages, Julia has a first-class type representing
a single character, called `Char`. This is just a special kind of 32-bit primitive type whose numeric
value represents a Unicode code point.
a single character, called `AbstractChar`. This is just a special kind of 32-bit primitive type whose numeric value represents a Unicode code point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily 32-bit now. Maybe mention Char or say "by default"?

base/char.jl Outdated
"""
Char

struct InvalidCharError{T<:AbstractChar} <: Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe InexactError could be used here? Just an idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'm not sure if that change belongs in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hadn't realized this type already existed in master.

@stevengj
Copy link
Member Author

stevengj commented Mar 1, 2018

I'm not sure what CodePoint accomplishes?

@Keno
Copy link
Member

Keno commented Mar 2, 2018

The idea of codepoint is to make clear what the conversion is about. convert(UInt32, c) works fine technically of course, but the question is do you really want that? E.g. do you want to be able to push!(UInt32[1], 'a')? I'd argue that that operation doesn't really make much sense. One is an integer the other is a codepoint. They don't really have the same operations (e.g. you can do arithmetic on integers, but not on codepoints). It just seems like much larger of a semantic bridge than we usually ascribe to convert. A related, but separate argument is that packages will want a UTF32-style string anyway, for which CodePoint is the natural character type, so it makes sense to pre-define it for operations that operate on codepoints.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

@Keno, we could define the constructors Char(int) and UInt32(char) but not convert to avoid accidental conversions. That seems like a sensible thing to me, is consistent with e.g. #16024, and would address your objection without introducing a new type.

(UTF32String in LegacyStrings might want to define a CodePointChar or UTF32Char or something, but I don't see a need for this to be in Base.)

Rather than defining a single new type, it makes more sense to me to define a new function codepoint(c::AbstractChar)::Integer return the codepoint of c in the most appropriate integer type for c (e.g. possibly UInt8 for an ASCII char type), and have codepoint(::Type{<:AbstractChar}) return the corresponding integer type. reinterpret(codepoint(typeof(x)), x) would then return the raw encoded bytes.

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I approve modulo comments.

base/char.jl Outdated

rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)
rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit off that this supports any kind of Number rather than just Integer. Pre-existing issue, I know.

+(x::Integer, y::Char) = y + x
# fallbacks:
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y))
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better for the fallback comparisons to be done in UInt32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Char, but not vice versa.

Converting to UInt32 would mean that ASCIIChar('a') == typemax(Char) would throw an InvalidCharError rather than returning false.

Copy link
Member

@StefanKarpinski StefanKarpinski Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying. Perhaps this then:

isless(x::Char, y::AbstractChar) = isless(x, Char(y))
isless(x::AbstractChar, y::Char) = isless(Char(x), y)
isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y))

==(x::Char, y::AbstractChar) = x == Char(y)
==(x::AbstractChar, y::Char) = Char(x) == y
==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y)

base/char.jl Outdated

const hex_chars = UInt8['0':'9';'a':'z']

function show(io::IO, c::Char)
function show(io::IO, c::AbstractChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is generic because of the call to reinterpret(UInt32, c) – we have no idea what the representation of an abstract character type is – it may not even be 32-bits or a bits type at all. We could extract show_invalid(c::AbstractChar) as a method that characters have to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in another thread, we could use reinterpret(UInt32, Char(c)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reinterpret call is only for invalid chars. I agree that refactoring this seems like a good choice.

@@ -140,7 +140,7 @@ See also: [`getindex`](@ref), [`start`](@ref), [`done`](@ref),

start(s::AbstractString) = 1
done(s::AbstractString, i::Integer) = i > ncodeunits(s)
eltype(::Type{<:AbstractString}) = Char
eltype(::Type{<:AbstractString}) = Char # some string types may use another AbstractChar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we can put an ::eltype(s) type assert somewhere to catch this if it's wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure…where would it go?

"try map(f, collect(s)) or a comprehension instead"))
write(out, c′::Char)
write(out, c′::AbstractChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.

function length(g::GraphemeIterator)
c0 = typemax(Char)
function length(g::GraphemeIterator{S}) where {S}
c0 = eltype(S)(0x00000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is '\0' a safe character for this? It's not a malformed character – does the Unicode spec ensure that it can't ever combine with anything?

Copy link
Member Author

@stevengj stevengj Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, \0 has the Grapheme_Cluster_Break property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I used \0 here since hopefully every conceivable AbstractChar type can encode this.)

value represents a Unicode code point.
* Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing
a single character, called `AbstractChar`. The built-in `Char` subtype of `AbstractChar`
is a 32-bit primitive type that can represent any Unicode character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that Char represents Unicode characters as zero-padded UTF-8 bytes rather than as code point values? This seems like as good a place as any to mention that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we don't want to document the Char representation, lest people rely on it not changing?

@@ -41,9 +41,10 @@ There are a few noteworthy high-level features about Julia's strings:

## [Characters](@id man-characters)

A `Char` value represents a single character: it is just a 32-bit primitive type with a special literal
An `AbstractChar` value represents a single character: it is just a 32-bit primitive type with a special literal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer accurate since an AbstractChar could have any representation at all. The simplest fix is just to leave this as Char.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

print and write have different semantics for other types, so why not for AbstractChar?

write(io, 0x00) and write(io, 0x0000) output the same (isequal) value in different "encodings", whereas print in both cases outputs 0. So maybe it should be the same for AbstractChar. (This is also how UTF16String was read and written by read/write in LegacyStrings, and it seemed pretty convenient).

print(io, x) is defined to output text, and in that case I agree that the encoding should be determined by io (defaulting to UTF-8) and not by x.

(That is, there would be a fallback print(io, x::AbstractChar) = print(io, Char(x)), but no fallback for write or read: implementations would be responsible for providing the latter, and they would be understood to use a type-specific encoding.)

@StefanKarpinski
Copy link
Member

re: print/write behavior, that seems reasonable to me 👍. @JeffBezanson may have thoughts.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

I updated the PR to make the print/write distinction discussed above.

I also did a pass over the source code and I found lots of write calls that should be print calls if we want the latter to (potentially) have an io-specific encoding. I fixed most of these, since there should be no performance difference. As an exception, there are various cases (e.g. in printing numbers) where we output raw ASCII bytes to the stream with write or unsafe_write. I left these as-is so as not to affect performance, but future revisions should think about how to make this more generic. Improving this in the future shouldn't require breaking changes, though, so it can wait until 1.1 or later.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

@Keno, I tried getting rid of the convert functions and ran into trouble right away: base/chars.jl uses const hex_chars = UInt8['0':'9';'a':'z'], which implicitly relies on convert(UInt8, ::Char). Alternatives like UInt8.(['0':'9';'a':'z']) seem much more annoying. So, I'd prefer to leave that change to another PR, if any.

But, as mentioned by @StefanKarpinski above, I think it should be safe to deprecate the implicit Number conversions, leaving only implicit Integer conversions for now.

@Keno
Copy link
Member

Keno commented Mar 2, 2018

I'm actually much more comfortable with an implicit conversion to UInt8, as long as that conversion is only defined where convert(UInt8, c::Char) == reinterpret(UInt32, c::Char) % UInt8.

@Keno
Copy link
Member

Keno commented Mar 2, 2018

But fine to do in a different PR.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

Added the codepoint(c::Char) function as discussed above.

base/char.jl Outdated
their own implementations of `write` and `read`.
via `codepoint(char)` will not reveal this encoding because it always returns the
Unicode value of the character. `print(io, c)` of any `c::AbstractChar`
produces UTF-8 output by default (via conversion to `Char` if necessary).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify that it's not necessary that print(io, c) produces UTF-8 but rather that the output encoding is determined by io – and the built-in IO types are all UTF-8.

@stevengj
Copy link
Member Author

stevengj commented Mar 2, 2018

Note that to really support non-UTF8 IO streams, we'll eventually want to have some kind of trait(io) that gives us information about the encoding.

Not only do we want to know whether it is UTF-8 (so that e.g. the fast path for print(io, ::String) can be used), but we also want to know whether it is any superset of ASCII (so that e.g. the fast path for writing ASCII numeric digits can be used).

That will involve a separate design process, and should be non-breaking as far as I can tell, so it can happen outside of this PR.

@stevengj
Copy link
Member Author

stevengj commented Mar 7, 2018

AppVeyor CI failure seems to be an unrelated timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants