Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that OsString and OsStr are bytes; provide conversions to bytes #95290

Closed
wants to merge 2 commits into from

Conversation

joshtriplett
Copy link
Member

The OsString and OsStr types are wrappers around bytes, and are required by existing stable APIs to be byte-based encodings that are supersets of UTF-8 (since a &str can become an &OsStr without allocation). However, both types are opaque, and do not provide access to raw bytes, primarily because on Windows they use the WTF-8 encoding, the standard for which recommends against exposing it.

Nonetheless, many crates do rely on OsStr and OsString being bytes. We already provide the OsStrExt and OsStringExt extension traits, which get used reasonably widely despite being non-portable. And the os_str_bytes crate has millions of downloads, and tens of thousands of new downloads everyday.

This PR is an effort to seek consensus for making the contents of an OsStr or OsString available as bytes, inspired by recent discussions within the libs-api team.

This is a minimal PR to acheive that goal. This PR consists almost entirely of documentation, updating the module documentation in std::ffi and the type documentation on OsString. This PR only adds two methods: OsStr::as_bytes and OsString::into_vec, and plumbs those methods down to the existing implementations for both UNIX and Windows. Both methods are unstable (and I'll file a tracking issue for them as soon as I take this PR out of draft status).

In particular, this PR does not add any new From or TryFrom or Into or TryInto impls, nor does it add methods to construct OsStr or OsString from bytes (which would require error handling), nor does it attempt to do any further unification of implementations between Windows and UNIX, nor does it extend these types to have more of the functionality of bstr. All of those things could happen in potential future changes, and I'd be happy to make some of those changes myself, but I'd like to focus on the initial consensus for making the change before further work that depends on the change.

As one further note of the implications of this change: there has been a proposal to change OsStr/OsString on Windows from WTF-8 to a more complex encoding that can potentially speed up matching for use with the Pattern API. Since that proposal, we've decided to seal the Pattern API and avoid stabilizing it. I would propose that we not switch the encoding of OsStr/OsString, and proceed with exposing bytes using the current encoding. Another option would be to expose bytes but not specify the exact encoding beyond "superset of UTF-8".

r? @ghost

@joshtriplett joshtriplett added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. I-libs-api-nominated Nominated for discussion during a libs-api team meeting. labels Mar 24, 2022
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Mar 25, 2022

are required by existing stable APIs to be byte-based encodings that are supersets of UTF-8 (since a &str can become an &OsStr without allocation).

It seems like this only requires that there is some variant of OsStr that is byte-based encoded as a superset of UTF-8, right? It would be possible for OsStr to wrap something like enum OsStrInternal { Utf8(&str), Other(???) }, presuming we used some kind of pointer tricks or whatever that are compatible with &str's representation to encode the state -- i.e., not actually an enum. (For example, Other might be encoded by moving the length of the str +isize::MAX, since such lengths aren't valid &str).

I mostly ask/note this because I want to make sure that this kind of option is explicitly mentioned in the writeup (and potentially rejected, either for complexity or infeasibility if (for example) we can't think of an encoding that will work), so that we can point back to it when thinking about this in the future.

@joshtriplett
Copy link
Member Author

are required by existing stable APIs to be byte-based encodings that are supersets of UTF-8 (since a &str can become an &OsStr without allocation).

It seems like this only requires that there is some variant of OsStr that is byte-based encoded as a superset of UTF-8, right? It would be possible for OsStr to wrap something like enum OsStrInternal { Utf8(&str), Other(???) }, presuming we used some kind of pointer tricks or whatever that are compatible with &str's representation to encode the state -- i.e., not actually an enum. (For example, Other might be encoded by moving the length of the str +isize::MAX, since such lengths aren't valid &str).

I mostly ask/note this because I want to make sure that this kind of option is explicitly mentioned in the writeup (and potentially rejected, either for complexity or infeasibility if (for example) we can't think of an encoding that will work), so that we can point back to it when thinking about this in the future.

Fair enough; strictly speaking, the requirement would be that OsStr must be a superset of a UTF-8 str, which is not quite the same as saying that the bytes must be a superset of UTF-8. I think in practice the documentation should not actually make that distinction as it would be confusing to the reader, but libs-api should absolutely allow for that possibility when considering options.

We could, for example, use some encoding that has a second variant for a native [u16] on Windows. I don't think we should do that here, even though we might be able to.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

…o bytes

This commit consists almost entirely of documentation, updating the
module documentation in `std::ffi` and the type documentation on
`OsString`. This only adds two methods: `OsStr::as_bytes` and
`OsString::into_vec`, and plumbs those methods down to the existing
implementations for both UNIX and Windows.
@dylni
Copy link
Contributor

dylni commented Mar 26, 2022

And the os_str_bytes crate has millions of downloads, and tens of thousands of new downloads everyday.

Note that this crate doesn't rely on the internal representation at all. It decodes the encoded u16s from OsStrExt::encode_wide. So, this crate would not make sense to be a reason that the internal encoding could not change. However, the current libstd API does only allow limited changes.

Also, the preferred way to use the crate is to use RawOsStr and RawOsString, which make it rare to need the raw bytes. For example, clap only uses those structs.

I think this PR would still be a good idea, since many people rely on the representation anyway. However, I wanted to clarify those points.

on Windows they use the WTF-8 encoding, the standard for which recommends against exposing it.

One option would be to do the same as os_str_bytes and document limited invariants about the encoding. This would allow many of the same changes that could be made now to the internal representation.

@dylni
Copy link
Contributor

dylni commented Mar 26, 2022

There is one problem if this PR gets merged and starts getting used. Without a wrapper, it becomes very easy to mix WTF-8 and UTF-8 bytes. For searching and concatenation, that would be a problem, since both need special handling of WTF-8 bytes.

Also, if storage is permitted, the encoding could never change.

@joshtriplett
Copy link
Member Author

joshtriplett commented Mar 30, 2022

One option would be to do the same as os_str_bytes and document limited invariants about the encoding.

I have no objections to documenting a lesser guarantee and making the encoding less specified, if that's needed in order to get consensus to make this change.

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 7, 2022
rustdoc: Early doc link resolution fixes and refactorings

A subset of rust-lang#94857 that shouldn't cause perf regressions, but should fix some issues like https://rust-lang.zulipchat.com/#narrow/stream/266220-rustdoc/topic/ICE.20in.20collect_intra_doc_links.2Ers rust-lang#95290 and improve performance in cases like rust-lang#95694.
@petrochenkov
Copy link
Contributor

#95706 has landed, so af7b7da should no longer be necessary.

@epage
Copy link
Contributor

epage commented Apr 21, 2022

In particular, this PR does not add any new From or TryFrom or Into or TryInto impls, nor does it add methods to construct OsStr or OsString from bytes (which would require error handling), nor does it attempt to do any further unification of implementations between Windows and UNIX, nor does it extend these types to have more of the functionality of bstr. All of those things could happen in potential future changes, and I'd be happy to make some of those changes myself, but I'd like to focus on the initial consensus for making the change before further work that depends on the change.

I appreciate the progress on this and recognize that other crates can benefit from this but I figure it'd be worth documenting where this is deficient for at least one use case, clap. At a high level, clap needs

  • starts_with which we emulate with this PR at the byte-slice level
  • strip_prefix which can't be emulated
  • split_once which can't be emulated

There are a couple of ways to unblock clap:

If there is anything I can do to help move along any of the parts to unblock clap, let me know!

epage added a commit to epage/clap that referenced this pull request Mar 27, 2023
We are doing direct transmutes between `OsStr` and `[u8]`.
rust-lang/rust#95290 would make this natively
supported but I got tired of waitin for it.

This only saves about 1/4s off of `cargo build`.

This took 2.9 KiB off of `cargo bloat --release --example git`
@mortie
Copy link

mortie commented Apr 18, 2023

I'm very interested in the status of this. I currently use unsafe mem::transmute calls (inspired by Clap) in my command line parsing function, and I frankly don't think it's acceptable that correct argument handling (whether it's options parsing or the only uses people have mentioned here) is impossible in portable and safe rust. Has there been any progress?

@epage
Copy link
Contributor

epage commented Apr 18, 2023

@mortie #109698 is under discussion, focusing on codifying that its proper to use transmute

epage added a commit to epage/rust that referenced this pull request May 29, 2023
`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request May 30, 2023
Allow limited access to `OsStr` bytes

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).

Tracking issue: rust-lang#111544
bors added a commit to rust-lang-ci/rust that referenced this pull request May 30, 2023
Allow limited access to `OsStr` bytes

`OsStr` has historically kept its implementation details private out of
concern for locking us into a specific encoding on Windows.

This is an alternative to rust-lang#95290 which proposed specifying the encoding on Windows.  Instead, this
only specifies that for cross-platform code, `OsStr`'s encoding is a superset of UTF-8 and defines
rules for safely interacting with it

At minimum, this can greatly simplify the `os_str_bytes` crate and every
arg parser that interacts with `OsStr` directly (which is most of those
that support invalid UTF-8).

Tracking issue: rust-lang#111544
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has been superseded by #109698.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.