-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add u16::is_utf16_surrogate #94713
Add u16::is_utf16_surrogate #94713
Conversation
r? @scottmcm (rust-highfive has picked a reviewer for you, use r? to override) |
39c0ae4
to
3b3117b
Compare
This comment has been minimized.
This comment has been minimized.
Is that guaranteed or could that change in the future? |
I agree with @ChrisDenton here -- this doesn't seem like a safe assumption to bake into an API. People can convert it to a But it doesn't need to block this PR. |
library/core/src/num/mod.rs
Outdated
#[unstable(feature = "is_char_surrogate", issue = "none")] | ||
#[rustc_const_unstable(feature = "is_char_surrogate", issue = "none")] | ||
#[inline] | ||
pub const fn is_char_surrogate(self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the _ascii_
methods on u8
, this seems like a reasonable thing to add.
However, I think it needs a different name. A char
can never be a surrogate, so mentioning it seems wrong.
I think the _ascii_
names and from_utf8
and such mean that it should mention the encoding it's using, not a datatype.
So how about this?
pub const fn is_char_surrogate(self) -> bool { | |
pub const fn is_utf16_surrogate(self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; I was originally thinking of going with is_unicode_surrogate
but decided it was too long.
My main apprehension with calling it utf16
surrogate is it implies that it's specific to UTF-16, which it actually isn't; it's specific to Unicode, even though its inclusion in Unicode is specific for UTF-16. I will agree that the name is better than the original, though.
library/core/src/char/decode.rs
Outdated
@@ -91,7 +91,7 @@ impl<I: Iterator<Item = u16>> Iterator for DecodeUtf16<I> { | |||
None => self.iter.next()?, | |||
}; | |||
|
|||
if u < 0xD800 || 0xDFFF < u { | |||
if !u.is_char_surrogate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this makes me think that the API might want to be -> Option<bool>
or something to distinguish high/low surrogates.
This code looks like it would be better written with exhaustive range patterns today (they almost certainly didn't exist when it was written, though). So the nice code change, to me, would be one that could match c.tell_me_about_surrogate_ness()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I was debating on how exactly to introduce these APIs, but what I was thinking is a good logical step is some form of check whether a surrogate is a low or high surrogate, or a way to combine surrogates into a code point. But at least that last point is mostly covered by decode_utf16
, so, I wasn't sure what extent was useful.
Either way, there is definitely room for expansion.
So, I was partially right; although the |
3b3117b
to
1608ed3
Compare
1608ed3
to
52023d1
Compare
Renamed, rebased, fixed build errors (hopefully), and opened a tracking issue since this seems desired. |
This comment has been minimized.
This comment has been minimized.
52023d1
to
66705d6
Compare
This comment has been minimized.
This comment has been minimized.
66705d6
to
e0e8d33
Compare
This comment has been minimized.
This comment has been minimized.
e0e8d33
to
d580367
Compare
@rustbot ready |
This looks good for nightly to me! @bors r+ |
📌 Commit d580367 has been approved by |
Rollup of 6 pull requests Successful merges: - rust-lang#91608 (Fold aarch64 feature +fp into +neon) - rust-lang#92955 (add perf side effect docs to `Iterator::cloned()`) - rust-lang#94713 (Add u16::is_utf16_surrogate) - rust-lang#95212 (Replace `this.clone()` with `this.create_snapshot_for_diagnostic()`) - rust-lang#95219 (Modernize `alloc-no-oom-handling` test) - rust-lang#95222 (interpret/validity: improve clarity) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
Right now, there are methods in the standard library for encoding and decoding UTF-16, but at least for the moment, there aren't any methods specifically for
u16
to help work with UTF-16 data. Since the full logic already exists, this wouldn't really add any code, just expose what's already there.This method in particular is useful for working with the data returned by Windows
OsStrExt::encode_wide
. Initially, I was planning to also offer aTryFrom<u16> for char
, but decided against it for now. There is plenty of code in rustc that could be rewritten to use this method, but I only checked within the standard library to replace them.I think that offering more UTF-16-related methods to u16 would be useful, but I think this one is a good start. For example, one useful method might be
u16::is_pattern_whitespace
, which would check if something is the UnicodePattern_Whitespace
category. We can get away with this because all of thePattern_Whitespace
characters are in the basic multilingual plane, and hence we don't need to check for surrogates.