Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Display for invalid UTF-8 in OsStr/Path #136677

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thaliaarchi
Copy link
Contributor

@thaliaarchi thaliaarchi commented Feb 7, 2025

Fix the Display implementations for OsStr, OsString, Path, and PathBuf, which incorrectly handle formatter flags. This is visible on stable (for paths) or with #![feature(os_str_display)] (for OS strings).

I do the formatting by scanning in two passes, similarly to the the Display impl for str. The first pass counts the number of Unicode scalars to render. Since there's no escaping, this is relatively inexpensive. Then the left padding is written. Then the second pass writes the string with lossy substitutions. Finally, the right padding is written.

I have based it on the logic of Formatter::pad, which I optimized and refactored in #136662. That PR should be reviewed first.

Fixes #136617. See there for more details.

@rustbot
Copy link
Collaborator

rustbot commented Feb 7, 2025

r? @cuviper

rustbot has assigned @cuviper.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 7, 2025
@thaliaarchi thaliaarchi changed the title Handle formatter flags in Display for OsStr/Path Fix Display for invalid UTF-8 in OsStr/Path Feb 7, 2025
@thaliaarchi thaliaarchi force-pushed the os_str-formatter-flags branch from beff1c1 to 4471500 Compare February 7, 2025 20:08
@workingjubilee
Copy link
Member

cuviper will be busy for a bit longer probably.
r? libs

@rustbot rustbot assigned Noratrieb and unassigned cuviper Feb 13, 2025
@workingjubilee
Copy link
Member

also cc @ChrisDenton?

@ChrisDenton
Copy link
Member

I'll review this properly when I have a moment (probably tomorrow) but I would note there are a number of incidental changes, like formatter arguments changed to f or things moved around. Which is fine but does make picking out the consequential changes more work. Also there are a lot of new uses of #[inline]. Is this something you've benchmarked or seen a difference in codegen?

@thaliaarchi
Copy link
Contributor Author

I would note there are a number of incidental changes, like formatter arguments changed to f or things moved around. Which is fine but does make picking out the consequential changes more work. Also there are a lot of new uses of #[inline]. Is this something you've benchmarked or seen a difference in codegen?

Those changes came out of wanting to synchronize the bytes.rs and wtf8.rs versions of OsString/OsStr so they're easier to diff between each other. Those changes are all confined to a3cf76c (Synchronize platform adaptors for OsString/OsStr), so they'd be easier to selectively or wholesale revert. I may have been overzealous in normalization, but I tried to minimize reordering.

I added #[inline] to all inherent methods of the OsString/OsStr shims, because it seemed that was already the rough pattern. bytes.rs has more inlining than wtf8.rs, so I added the corresponding ones to wtf8.rs. Then, the common missing ones have no discernible pattern to me. They're not divided by non-allocating/allocating. Perhaps the pattern is that UTF-8 validation isn't inlined? Since these types are merely the inner values in OsStr/OsString, I put inline on all methods and let those public types dictate inlining. I have not inspected codegen or run benchmarks.

If it would be better, I could split it into two PRs, one for the Display fixes and one for organization.

@ChrisDenton
Copy link
Member

ChrisDenton commented Feb 14, 2025

Those changes came out of wanting to synchronize the bytes.rs and wtf8.rs versions of OsString/OsStr so they're easier to diff between each other. Those changes are all confined to a3cf76c (Synchronize platform adaptors for OsString/OsStr), so they'd be easier to selectively or wholesale revert. I may have been overzealous in normalization, but I tried to minimize reordering.

That seems like a fair motivation so I'll say it's fine as-is.

I added #[inline] to all inherent methods of the OsString/OsStr shims, because it seemed that was already the rough pattern. bytes.rs has more inlining than wtf8.rs, so I added the corresponding ones to wtf8.rs. Then, the common missing ones have no discernible pattern to me. They're not divided by non-allocating/allocating. Perhaps the pattern is that UTF-8 validation isn't inlined? Since these types are merely the inner values in OsStr/OsString, I put inline on all methods and let those public types dictate inlining. I have not inspected codegen or run benchmarks.

I feel #[inline] has historically gotten added on "vibes" so I was wondering if there was a motivation here. As above, I'm not against aligning the two modules so I'll not probe further. Maybe the use of #[inline] can be investigated separately but it needn't be done for this PR.

@Noratrieb Noratrieb removed their assignment Feb 16, 2025
@thaliaarchi thaliaarchi force-pushed the os_str-formatter-flags branch from 4471500 to d52e2b2 Compare February 16, 2025 23:16
@rust-log-analyzer

This comment has been minimized.

@thaliaarchi
Copy link
Contributor Author

thaliaarchi commented Feb 16, 2025

Thanks everyone for bearing with me, since I'm new contributing to Rust. Since this PR tries to do too much at once, I'm splitting it into three PRs for more focused review. They do not depend on each other and can be reviewed in any order. The feedback so far has only been on the refactors, which are now in #137154.

@thaliaarchi thaliaarchi force-pushed the os_str-formatter-flags branch 2 times, most recently from 84af8a4 to 6f01d7d Compare February 20, 2025 03:09
@thaliaarchi
Copy link
Contributor Author

thaliaarchi commented Feb 20, 2025

@rustbot blocked

We're waiting on #136662, since feedback there affects this, and possibly also #136693, since we're both touching lossy UTF-8 validation.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-blocked Status: Blocked on something else such as an RFC or other implementation work. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 20, 2025
@bors
Copy link
Contributor

bors commented Feb 26, 2025

☔ The latest upstream changes (presumably #137036) made this pull request unmergeable. Please resolve the merge conflicts.

The Display implementation for `OsStr` and `Path` on Windows (the WTF-8
version) only handles formatter flags when the entire string is valid
UTF-8. As most paths are valid UTF-8, the common case is formatted like
`str`; however, flags are ignored when they contain an unpaired
surrogate. Implement its Display with the same logic as that of `str`.

Fixes rust-lang#136617 for Windows.
Benchmark shows that performance is unaffected
(`x bench library/alloc --stage 0 --test-args from_utf8_lossy`).
The Display implementation for `OsStr` and `Path` on non-Windows systems
(the byte-oriented version) only handles formatter flags when the entire
string is valid UTF-8. As most paths are valid UTF-8, the common case is
formatted like `str`; however, flags are ignored when they contain an
invalid UTF-8 sequence. Implement its Display with the same logic as
that of `str`.

Additionally, it forwards flags when formatting the last chunk of valid
UTF-8, so it erroneously inserts padding within the string when it has
invalid UTF-8. Fix this.

Fixes rust-lang#136617 for non-Windows systems.
@thaliaarchi thaliaarchi force-pushed the os_str-formatter-flags branch from 6f01d7d to bf5f172 Compare February 26, 2025 19:02
@thaliaarchi
Copy link
Contributor Author

thaliaarchi commented Feb 28, 2025

@joshtriplett I just noticed the new bstr feature (🎉!). Its Display impl does roughly the same thing as this PR, just without reusing Formatter internals, and I think it would be worth unifying them. I aim to do less char counting, so I'm partial to my approach, but it's worth discussing.

@thaliaarchi
Copy link
Contributor Author

Here's an update on where this PR is at. Its scope has the potential to increase a lot.

First of all, my goal is to make the Display impls of OsStr/Wtf8/ByteStr function the same as for str and, when possible, be just as performant. Currently, those are only true when the entire text is valid UTF-8.

I submitted #136662 (Count char width at most once in Formatter::pad) to improve the performance of Display for str and want to bring the others up to parity. This PR is currently designed based on an older version of #136662. The final version now uses both Chars::count and CharIndices::advance_by, which both have efficient overrides (see #137761 for advance_by).

This motivates similar types for OsStr and ByteStr. We already have Wtf8CodePoints, but would also need Wtf8CodePointIndices, CharsLossy, and CharIndicesLossy. Wtf8CodePointIndices is trivial, as it can reuse CharIndices internals. The lossy ones would become part of the bstr API and would need to go through a formal proposal. CharsLossy::advance_by would require a new algorithm, since the chunking strategy used by Chars::advance_by would need adaptations to work with invalid UTF-8. I think taking the time to build these is better than just inlining what they would have been or using an existing, less efficient API.

The addition of ByteStr/ByteString means it would be ideal to unify it with the bytes version of OsStr/OsString. They do exactly the same thing, just with slightly different APIs. Instead of std::sys::os_str::bytes::{Buf, Slice} being wrappers over [u8]/Vec<u8>, they should wrap ByteStr/ByteString. This would mean fleshing out the bstr APIs to support everything they need. The only large methods to be moved are Display and check_public_boundary. As an intermediate step, moving Display would be sufficient for this fix.

If we're going to move the bytes OsStr/OsString, then the WTF-8 version would make sense to also move. I propose it be moved to core::wtf8 and alloc::wtf8. These types share most of their code with str/String, because most algorithms which work with Unicode scalar values also work with Unicode code points without modification. Moving them to core/alloc would make this reuse more seamless. However, this move is not required for this fix.

What I propose is necessary for this PR:

  • Create Wtf8CodePointIndices.
    • Optimize {Wtf8CodePoints, Wtf8CodePointIndices}::advance_by.
  • Create core::bstr::{CharsLossy, CharIndicesLossy}.
    • Optimize Iterator::advance_by for them.
  • Defer to ByteStr as Display in bytes OsStr as Display.

Nice to have:

  • Move std::sys_common::wtf8::{Wtf8, Wtf8Buf, Wtf8CodePoints} to core::wtf8::{Wtf8, CodePoints, CodePointIndices} and alloc::wtf8::Wtf8Buf.
  • Move std::sys::os_str::bytes::{Buf, Slice} functionality into core::bstr::ByteStr and alloc::bstr::ByteString.
    • Move Display and check_public_boundary (the larger methods).
    • Implement the smaller methods on bstr types.

I'll be opening separate PRs for each component, so we can discuss them more focused. However, this PR will cherry pick those commits until they merge.

How does this strategy sound?

@thaliaarchi
Copy link
Contributor Author

Since this will be blocked for a while, I intend to open a tiny PR which only fixes the correctness bug for bytes OsStr/Path when aligning invalid UTF-8 sequences (see #136617):

#![cfg(unix)]
use std::ffi::OsStr;
use std::os::unix::ffi::OsStrExt;
let s = OsStr::from_bytes(b"b\xFFd");
assert_eq!(format!("a{:^10}e", s.display()), "ab�    d     e"); // BUG

@thaliaarchi
Copy link
Contributor Author

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-blocked Status: Blocked on something else such as an RFC or other implementation work. labels Mar 10, 2025
@ChrisDenton
Copy link
Member

How does this strategy sound?

That sounds good to me. Maybe you could open a tracking issue for this? It seems like it might be good to nominate it for t-libs discussion if only to raise awareness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Path/OsStr incorrectly handle Formatter flags
8 participants