Add u16::is_utf16_surrogate #94713

clarfonthey · 2022-03-07T21:48:33Z

Right now, there are methods in the standard library for encoding and decoding UTF-16, but at least for the moment, there aren't any methods specifically for u16 to help work with UTF-16 data. Since the full logic already exists, this wouldn't really add any code, just expose what's already there.

This method in particular is useful for working with the data returned by Windows OsStrExt::encode_wide. Initially, I was planning to also offer a TryFrom<u16> for char, but decided against it for now. There is plenty of code in rustc that could be rewritten to use this method, but I only checked within the standard library to replace them.

I think that offering more UTF-16-related methods to u16 would be useful, but I think this one is a good start. For example, one useful method might be u16::is_pattern_whitespace, which would check if something is the Unicode Pattern_Whitespace category. We can get away with this because all of the Pattern_Whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates.

rust-highfive · 2022-03-07T21:48:36Z

r? @scottmcm

(rust-highfive has picked a reviewer for you, use r? to override)

ChrisDenton · 2022-03-07T23:00:41Z

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

Is that guaranteed or could that change in the future?

scottmcm · 2022-03-08T00:41:49Z

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

I agree with @ChrisDenton here -- this doesn't seem like a safe assumption to bake into an API. People can convert it to a char themselves if they want to encode that assumption.

But it doesn't need to block this PR.

scottmcm · 2022-03-08T00:44:32Z

library/core/src/num/mod.rs

+    #[unstable(feature = "is_char_surrogate", issue = "none")]
+    #[rustc_const_unstable(feature = "is_char_surrogate", issue = "none")]
+    #[inline]
+    pub const fn is_char_surrogate(self) -> bool {


Given all the _ascii_ methods on u8, this seems like a reasonable thing to add.

However, I think it needs a different name. A char can never be a surrogate, so mentioning it seems wrong.

I think the _ascii_ names and from_utf8 and such mean that it should mention the encoding it's using, not a datatype.

So how about this?

Suggested change

pub const fn is_char_surrogate(self) -> bool {

pub const fn is_utf16_surrogate(self) -> bool {

You're right; I was originally thinking of going with is_unicode_surrogate but decided it was too long.

My main apprehension with calling it utf16 surrogate is it implies that it's specific to UTF-16, which it actually isn't; it's specific to Unicode, even though its inclusion in Unicode is specific for UTF-16. I will agree that the name is better than the original, though.

scottmcm · 2022-03-08T00:52:55Z

library/core/src/char/decode.rs

@@ -91,7 +91,7 @@ impl<I: Iterator<Item = u16>> Iterator for DecodeUtf16<I> {
            None => self.iter.next()?,
        };

-        if u < 0xD800 || 0xDFFF < u {
+        if !u.is_char_surrogate() {


Hmm, this makes me think that the API might want to be -> Option<bool> or something to distinguish high/low surrogates.

This code looks like it would be better written with exhaustive range patterns today (they almost certainly didn't exist when it was written, though). So the nice code change, to me, would be one that could match c.tell_me_about_surrogate_ness() instead.

So, I was debating on how exactly to introduce these APIs, but what I was thinking is a good logical step is some form of check whether a surrogate is a low or high surrogate, or a way to combine surrogates into a code point. But at least that last point is mostly covered by decode_utf16, so, I wasn't sure what extent was useful.

Either way, there is definitely room for expansion.

clarfonthey · 2022-03-08T16:00:46Z

We can get away with this because all of the whitespace characters are in the basic multilingual plane, and hence we don't need to check for surrogates

Is that guaranteed or could that change in the future?

So, I was partially right; although the White_Space property is open to expansion despite how unlikely this is, Pattern_White_Space (which is larger than just ASCII whitespace, but smaller than general White_Space) is guaranteed by the standard to never include future characters, and is strictly within the BMP. So, this could be a reasonable addition in the future.

clarfonthey · 2022-03-13T22:34:17Z

Renamed, rebased, fixed build errors (hopefully), and opened a tracking issue since this seems desired.

clarfonthey · 2022-03-22T03:41:56Z

@rustbot ready

scottmcm · 2022-03-22T20:26:53Z

This looks good for nightly to me!

@bors r+

bors · 2022-03-22T20:26:54Z

📌 Commit d580367 has been approved by scottmcm

Rollup of 6 pull requests Successful merges: - rust-lang#91608 (Fold aarch64 feature +fp into +neon) - rust-lang#92955 (add perf side effect docs to `Iterator::cloned()`) - rust-lang#94713 (Add u16::is_utf16_surrogate) - rust-lang#95212 (Replace `this.clone()` with `this.create_snapshot_for_diagnostic()`) - rust-lang#95219 (Modernize `alloc-no-oom-handling` test) - rust-lang#95222 (interpret/validity: improve clarity) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup

rust-highfive assigned scottmcm Mar 7, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 7, 2022

clarfonthey force-pushed the is_char_surrogate branch 2 times, most recently from 39c0ae4 to 3b3117b Compare March 7, 2022 21:53

This comment has been minimized.

Sign in to view

scottmcm requested changes Mar 8, 2022

View reviewed changes

scottmcm reviewed Mar 8, 2022

View reviewed changes

Dylan-DPC added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 8, 2022

clarfonthey force-pushed the is_char_surrogate branch from 3b3117b to 1608ed3 Compare March 13, 2022 22:26

clarfonthey changed the title ~~Add u16::is_char_surrogate~~ Add u16::is_utf16_surrogate Mar 13, 2022

clarfonthey mentioned this pull request Mar 13, 2022

Tracking Issue for extra UTF-16 methods #94919

Open

3 tasks

clarfonthey force-pushed the is_char_surrogate branch from 1608ed3 to 52023d1 Compare March 13, 2022 22:33

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the is_char_surrogate branch from 52023d1 to 66705d6 Compare March 22, 2022 01:12

This comment has been minimized.

Sign in to view

clarfonthey force-pushed the is_char_surrogate branch from 66705d6 to e0e8d33 Compare March 22, 2022 02:26

This comment has been minimized.

Sign in to view

Add u16::is_utf16_surrogate

d580367

clarfonthey force-pushed the is_char_surrogate branch from e0e8d33 to d580367 Compare March 22, 2022 02:51

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 22, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 22, 2022

Dylan-DPC mentioned this pull request Mar 23, 2022

Rollup of 6 pull requests #95223

Merged

bors merged commit 25acd93 into rust-lang:master Mar 23, 2022

rustbot added this to the 1.61.0 milestone Mar 23, 2022

clarfonthey deleted the is_char_surrogate branch April 16, 2022 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add u16::is_utf16_surrogate #94713

Add u16::is_utf16_surrogate #94713

Uh oh!

clarfonthey commented Mar 7, 2022 •

edited

Loading

Uh oh!

rust-highfive commented Mar 7, 2022

Uh oh!

This comment has been minimized.

ChrisDenton commented Mar 7, 2022

Uh oh!

scottmcm commented Mar 8, 2022

Uh oh!

scottmcm Mar 8, 2022

Uh oh!

clarfonthey Mar 8, 2022

Uh oh!

scottmcm Mar 8, 2022

Uh oh!

clarfonthey Mar 8, 2022

Uh oh!

clarfonthey commented Mar 8, 2022 •

edited

Loading

Uh oh!

clarfonthey commented Mar 13, 2022

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Mar 22, 2022

Uh oh!

scottmcm commented Mar 22, 2022

Uh oh!

bors commented Mar 22, 2022

Uh oh!

Uh oh!

	pub const fn is_char_surrogate(self) -> bool {
	pub const fn is_utf16_surrogate(self) -> bool {

Add u16::is_utf16_surrogate #94713

Add u16::is_utf16_surrogate #94713

Uh oh!

Conversation

clarfonthey commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Mar 7, 2022

Uh oh!

This comment has been minimized.

ChrisDenton commented Mar 7, 2022

Uh oh!

scottmcm commented Mar 8, 2022

Uh oh!

scottmcm Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

clarfonthey Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

scottmcm Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

clarfonthey Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

clarfonthey commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clarfonthey commented Mar 13, 2022

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

clarfonthey commented Mar 22, 2022

Uh oh!

scottmcm commented Mar 22, 2022

Uh oh!

bors commented Mar 22, 2022

Uh oh!

Uh oh!

clarfonthey commented Mar 7, 2022 •

edited

Loading

clarfonthey commented Mar 8, 2022 •

edited

Loading