Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<format>: Assume UTF-8 format strings when execution charset is UTF-8 #1824

Merged

Conversation

statementreply
Copy link
Contributor

When execution charset is UTF-8, assume that format strings are encoded in UTF-8, not in the active code page.

This PR only attempts to detect UTF-8.

  • UTF-8 is the modern and most important text encoding.
  • A Windows program has always been doomed when it uses non-Unicode encodings, and the developers and the users assume different code pages.

Fixes #1820.

@statementreply statementreply requested a review from a team as a code owner April 11, 2021 09:27
@statementreply statementreply changed the title <format>: Assume UTF-8 format strings when encoding charset is UTF-8 <format>: Assume UTF-8 format strings when execution charset is UTF-8 Apr 11, 2021
(void) format("{:\x9f\x8f\x88<10}"sv, 42); // Bad fill character encoding: missing lead byte before \x9f
assert(false);
} catch (const format_error&) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check the error message here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eagh, the value of checking the error messages is very low.

@StephanTLavavej StephanTLavavej added the format C++20/23 format label Apr 11, 2021
@StephanTLavavej StephanTLavavej added the bug Something isn't working label Apr 11, 2021
@statementreply statementreply marked this pull request as draft April 12, 2021 04:17
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated
#pragma warning(pop)
}();

_NODISCARD inline int _Utf8_code_units_in_next_character(const char* const _First, const char* const _Last) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I see nothing in this function prevents it from being constexpr

However, I believe it only makes sense at runtime. So should we add a comment that this is intentionally not constexpr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it constexpr

stl/inc/format Outdated Show resolved Hide resolved
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my inexperienced point of view this looks good. I would like to cleanup _Code_units_in_next_character so that it only defers to subfunctions but this is purely a style thing which can / should be disregarded

@statementreply statementreply marked this pull request as ready for review April 12, 2021 17:02
Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I'll push changes for my minor comments here, and a couple more to mitigate merge conflicts with the commits I recently pushed to merging_format.

@barcharcraz barcharcraz merged commit b81d9eb into microsoft:feature/format Apr 13, 2021
@StephanTLavavej StephanTLavavej mentioned this pull request Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working format C++20/23 format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants