-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clang 14 rejects certain Unicode characters in identifiers that are accepted by Clang 13 and the C++ Standard #54732
Comments
@llvm/issue-subscribers-clang-frontend |
This is a deliberate change. I do wonder if we could make that clearer in the diagnostic though |
Thanks for the quick reply! That's unfortunate, these subscripts were really handy to use the mathematical notation from papers, formulas and pseudocode in the C++ implementation. Is there an option to get back the old behavior? (I didn't find one in https://clang.llvm.org/docs/ClangCommandLineReference.html but I might be searching for the wrong keywords.) |
For some inexplicable reason ∂, 𝜕 are now not allowed. Dear Steve Downey, Zach Laine, Tom Honermann, Peter Bindels, and Jens Maurer why do you hate derivatives so much? Dear Clang, please, don't adopt P1949R7 because it is ridiculous, unnecessary, and breaks existing code. |
I posted a comment at the corresponded review page: https://reviews.llvm.org/D104975#3486313 |
I agree that this behavior is intentional and some amount of broken code is expected as a result. I'm sorry you've been caught by that!
There is not. We could perhaps elect to not implement this paper in older language modes (so it only happens in -std=c++2b and later) and we could elect to add a feature flag so you can opt into a non-conforming mode in C++23 and later. However, such a change is somewhat risky and something I'd like to avoid unless we see significant code breakage in the wild (system headers, major third-party library headers, a ton of individual user projects, etc). Previously, neither the C nor the C++ committee had a principled reason for what was or wasn't a valid character in an identifier when it came to Unicode characters. This caused real problems (including a high-score CVE in the same space) and so the committees both decided to defer to the Unicode consortium as to what is and isn't a valid character for an identifier (with one exception for That said, the fact that this code was broken and it's causing you pain is helpful for the standards bodies to understand the impact of the changes. We'll make sure this information is fed back to the standards bodies (it's already generated some discussion from the original report). And if it starts to look like more people are getting caught by this, we'll certainly consider what changes we can make to ease the burdens.
While you might be frustrated by the situation, please do not disparage the hard work of others as being ridiculous or unnecessary, and please follow our Code of Conduct: https://llvm.org/docs/CodeOfConduct.html |
I insist that excluding math symbols like partial derivative is in my opinion unnecessary and strange. I would argue that 𝜕 has more sense than supporting emoji in identifiers. I am really sorry and regret that hard work of others was directed to something unnecessary and arbitrary, but hard work of others hardly can be a reason to change my opinion. GCC supports 𝜕, and I hope it's not going to change. |
I guess the reason why 𝜕 was excluded is somebody confused a mathematical operator, which is a function that acts on other functions or on some structured objects, with a programming language operator, which is professional lingo for a mathematical operation, or for some other language-specific weird operation. 𝜕 is a mathematical operator, for any purpose of a programmer it's just a letter. However, the existence of an explanation doesn't make this decision good, logical, sane, or necessary. 𝜕 is much more useful than emoji. If I implement an algorithm from a paper, and the paper says 𝜕Ω, this is what would be absolutely natural to use in the code. There is neither benefit nor sense in removing 𝜕. |
I'd like to add some notes. Generally speaking, a mathematical operator is an operator in the sense of programming languages, which derive the rules from math systems. The design to make operators always in specific syntactic categories is language-specific. For example, operators are punctuations (rather than identifiers) in C and C++, while many Lisp dialects just treat operators same to elements in the head subform of a function application (i.e. operators = functions being applied to) and such operators can be named by identifiers. (To be accurate, punctuators here are the "operator-or-punctuator" category in C++; operators and puctuators will be in different categories later.) I don't mean to change C/C++, but I'd argue the latter ("the Lisp style") is better than the former ("the ALGOL style") in contexts of language-agnostic meaning of the notion "operator". (I use "ALGOL" to suggest whether an operator can be defined be user is irrelavant here.) Traditionally, math systems do not distinguish syntactic and semantic forms of elements like operators, as they are always self-evaluating. That is, thay will not change to anything else during the deduction in the system, unless combined with somthing other (the operands). This is OK in traditional uses (just interested in getting the results of some computations), but formally probamatic when you want to describe the underlying system in more detailed ways (say, operational semantics). To describe the precise behavior of the system, you have to differentiate whether an element (not necessarily a self-evaluating one) is evaluated (more formally, reduced to the normal form) or not, and mixing different contextual meanings of such elements would be a mess. Languages like C and C++ are not totally formally defined. However, they still imply rules like deduction systems in their formal grammars. Specifically, C and C++ have notion of phases of translations, so lexical identical elements (even "self-evaluating" during the translation) can be actually different: identifiers of preprocessing tokens are not same to identifiers of tokens. This example is important because it is exactly concerned in the meaning of "identifiers" handled here. I don't think it is the Unicode consortium's work to clarify the mapping from the meaning in specific programming languages to the definition in the Unicode specification. In particular, not all programming languages need the distinction. (C and C++ need it, because some of identifiers of preprocessing tokens would be converted to keywords of tokens instead of identifiers of tokens.) So, it would hardly come true "to establish conventions that will be followed by most/all programming languages" without further efforts with more careful analyses (which are closer than "the Lisp style" rather than "the ALGOL style"). The syntactic element However, in some extended calculi, |
It is not. What we call a math operator in a programming language, in real math, is called an operation. Never in my life I heard anyone calling addition a "plus operator" in a calculus class. Wikipedia even has two separate articles for math operators and for programming language operators.
Yes, I agree. I think that P1949 helps nobody.
Not necessarily. It's common to denote a boundary of Ω as 𝜕Ω. With some effort, I guess you can develop a theory where 𝜕 in this case will be an operator, but in most papers it's just a syntax sugar to denote a boundary. |
@AaronBallman, thank you for the detailed reply. While I agree that serious issues and security vulnerabilities should be addressed and fixed retroactively for older standards, I feel this change goes way beyond that, for two reasons:
I'm in favor of ironing out some of the inconsistencies in the ranges of allowed characters, and addressing things like normalization may be important, but in my opinion, by suddenly disallowing perfectly reasonable characters, the current implementation of P1949 creates many more problems than it actually solves. Aside from the issue of backwards compatibility, I'd like to motivate the use of certain Unicode characters: I'd argue that it is perfectly reasonable to use variable names such as As an example from the code I'm currently working on: // Compute forward-backward envelope
φₖ₊₁ = ψₖ₊₁ + 1 / (2 * γₖ₊₁) * pₖ₊₁ᵀpₖ₊₁ + grad_ψₖ₊₁ᵀpₖ₊₁; With limited Unicode support, I have to write something like this: // Compute forward-backward envelope
φ_k_plus_1 = ψ_k_plus_1 + 1 / (2 * γ_k_plus_1) * p_k_plus_1ᵀp_k_plus_1
+ grad_ψ_k_plus_1ᵀp_k_plus_1; What could previously be effortlessly parsed and easily matched to the formula in the paper has now become an unreadable mess of letters and underscores. In the first expression, the variable names are distinct and concise, and can be recognized at a glance. As a result, you can easily focus on the operations that are actually carried out on them. Thanks to the subscripts, you automatically focus on the actual names rather than on the In the second expression, variable names share the Searching the code base of my current project, I found over a thousand matches for the subscripts 0 through 9. I noticed that P1949R7 states (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html#what-will-this-proposal-not-change):
Does this leave room for Clang and other compilers to still allow these characters even though they might be outside of XID_*?
I appreciate that, thank you. |
@tttapa I totally agree with everything you said. Btw, GCC has also implemented P1949 in version 12, but it still allows math symbols like Here is a demonstration with a snapshot of GCC from the release 12 branch: #include <iostream>
int main(int argc, char* argv[]) {
auto 𝜕Ω = 4;
auto φₖ₊₁ = 5;
std::cout << "𝜕Ω = " << 𝜕Ω << std::endl;
std::cout << "φₖ₊₁ = " << φₖ₊₁ << std::endl;
}
Btw, before you, I didn't know there is Unicode for subscripts and superscripts. Now I am totally going to use it. |
@llvm/issue-subscribers-c-2b |
Yes, we have the wiggle room to do this (for example, with a feature flag to let users opt out of the P1949 behavior), but there's some space between "can" and "should" we need to be careful to consider. That feature flag makes it far more likely you'll run into portability issues with your code, but if it's something you explicitly opt into, then that's your decision to make. I'm coming around to the idea of giving a feature flag for this -- if for no other reason, than because there was no deprecation period before this change broke code for some folks. Giving people an ability to upgrade to newer Clang versions while transitioning their code base to valid identifiers (or whatever other portability measure they want to take, if any) has value that may be worth the maintenance costs. I'm still not certain of the shape of the flag though -- does it allow any Unicode character no matter how dangerous it is in an identifier, does it allow only math symbols, something else? We'd have to figure out what the right behavior is, which sort of brings us right back to the challenge of "what's the principle behind whether a character is in or out?". As a strawman, I think we could say "allows math symbols too", but I'd definitely want input from @cor3ntin @tahonermann when deciding whether to add such a flag and what its behavior should be. (Note, one of the difficulties I hope we can avoid with the flag is compile time performance impact given that this involves lexing every character from a translation unit.) |
I would be strongly against that, as it seems perfectly reasonable for either C or C++ or other C derived language to want to support maths symbols as operator or as some syntax element of sort in the future, and so it would be a pretty big grab. And as you hinted, for clang to decide what else to allow would take a lot of resources for an unsatisfactory result, as any effort not involving actual Unicode experts would just be extremely opinionated. Identifiers are elements usable in words. Math symbols are not. But what about electronic symbols, engineering symbols, music notation, etc? All of these can be equally justified by "someone might use them" |
Agreed, which is why I've been resistant to this as much as I have been. However:
This point is still valid -- there's no deprecation period and we're adding constraints that are impacting our users. The constraints added are mildly related to security (see trojan source as an example), so on the one hand, no deprecation period is understandable. This happened for other things in C and C++ as well (implicit function declarations and gets() both immediately come to mind). On the other hand, it's not so strongly related to security that we shouldn't consider a transition period as we have for other hard breaking changes.
I think the motivation here is less "I like this character" and more "I want to upgrade to the latest Clang but can't because none of my code compiles.", which is reasonably strong motivation depending on how many folks are in that situation. We've gotten reports from two different users against Clang 14, which suggests this is causing more problems than anticipated. |
Nope, they are not. Identifiers are lexical tokens that name objects, period. Other requirements can be either technical (terminal in this system doesn't support Unicode), or personal preferences. "Elements usable in words" is a personal preference. |
This is infinitesimally small compared to other phases of compilation. For any language because building AST takes much more time. This is infinitesimally small 10-fold for C++. |
Given the lack of a deprecation period and the number of reports we've already seen, I am leaning in the same direction as Aaron. I am not in favor of the approach gcc took of allowing previously allowed (and not disallowed) characters to be used in non-pedantic modes by default, but I think an option that achieves the same result would be fine. This would ensure opt-in backward compatibility without having to make difficult (probably poor) choices regarding character allowances and ensure Clang has the ability to match gcc behavior. I don't have a great suggestion for a new option name. Perhaps |
I think that's fairly reasonable for an option name. Do you envision it going back to the old Clang behavior pre-P1949, or do you envision it being P1949 + additional allowances for only some characters (or class of characters)? |
@AaronBallman, I envision it matching what gcc does now; that it allows the union of pre-P1949 identifiers and P1949 identifiers. I don't think it should behave as though that set includes |
I don't think this is very plausible, given the committee's reluctance to introduce keywords. However, if they did decide to assign special meanings to some symbols, it would be similar to adding new keywords like
True, but I could easily turn this around: there is no strong motivation for suddenly disallowing harmless characters such as mathematical symbols and subscripts, after over a decade of allowing them. Don't get me wrong, I believe that problematic control characters should be disallowed, but I don't think this should necessarily mean that unrelated characters have to be removed as well, especially if this is a breaking change.
I strongly disagree. And this doesn't match the definition for Unicode's XID_Start and XID_Continue either: they contain 131974 and 135072 code points respectively, they are certainly not all usable in words, there are punctuation signs, characters from the phonetic alphabet, iteration marks, Arabic mathematical characters, etc. Even though they might not be usable in words, symbols like 𝜕 are useful as identifiers. E.g. one could argue that locally defining a function 𝛛 eases notation and is a sensible thing to do in some mathematical contexts: const auto 𝛛 = [](auto expression, auto variable) {
return partial_derivative(expression, variable);
};
I am not asking to add arbitrary characters to the allowed set, I'm solely requesting not to break code by suddenly removing harmless characters from the set. Regarding security: it should be noted that P1949 does not solve Trojan source problems, and does not address homoglyph attacks. Clang 14 is still vulnerable to CVE-2021-42574, because control characters are still allowed in other contexts like string literals. E.g. the original example from https://www.openwall.com/lists/oss-security/2021/11/01/1 compiles without warnings: https://godbolt.org/z/njvoEGd38 #include <string>
#include <iostream>
int main() {
std::string access_level = "user";
if (access_level != "user // Check if admin ") {
std::cout << "not a user\n";
}
}
|
Hi @tttapa. I don't think this is the right forum for some of the issues that you are raising. This issue is best used to focus on mitigating the impact of the P1949 changes. Concerns about what characters should or should not be allowed in identifiers would be best directed to WG21. I recommend you share your concerns with SG16. WG21 is now following Unicode guidance (pre-P1949, the character allowances included ranges of unassigned code points; that isn't a good strategy). If Unicode guidance changes (and it may as a result of the impact P1949 has had; there is a working group meeting regularly), then I'm sure WG21 will follow along. You are correct regarding the Trojan Source concerns and UAX #36. There is on-going work to address those concerns though I don't expect to have actionable guidance for quite some time. |
Any update on how users should deal with this in short term? Clang 15 release is on the way and I could not find any patch introducing a flag to either allow pre-P1949 identifiers or equalize the allowed characters with those in gcc. |
Unfortunately, nobody proposed a patch for Clang 15 introducing the option discussed. Further, WG14 adopted the same restrictions from https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2836.pdf at our Feb 2022 meeting. It's worth noting that https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2932.htm was discussed at our May 2022 meeting and there was consensus against adoption but some weak sentiment to consider it for a TS. |
This sort of hyperbole is not productive or appreciated, please re-familiarize yourself with our Code of Conduct. |
Thanks for this feedback! If there was a tool, such as a clang-tidy check, which would allow you to automatically modify problematic identifiers, would that automation be sufficient for you to migrate your code base? (I'm imagining something that would do simple renames, like replacing the problematic characters with a placeholder such as
The other issue is when removing the flag, how many people's build systems break as a result. (Flag deprecation is tricky but certainly not impossible.)
That's why I still am okay exploring the idea of adding a flag. Removal without a deprecation period is rather harsh and we have plenty of precedence for flags to allow people to migrate. However, our experience with quite a few of those flags is that they're detrimental in the long-run unless we are aggressive about removing the flag (implicit int and implicit function declarations both come to mind as recent examples). |
It's still a mighty stretch to compare the breakage of something that was more or less accidentally working in a field as messy as human text (due to intentional clean-ups based on the work of bodies charged with producing guidance on that very subject), with the removal of a central & clearly delineated feature such as |
It's not about intelligence or straightforwardness, so I'm very sorry if I've given you that impression! It's about "Be careful in the words that you choose and be kind to others" and "Be respectful" specifically. Calling an open source tool "low-quality software" or saying we "don't care your code doesn't work anymore" when you disagree with a behavior mandated by the standards comes across as denigrating a lot of people's hard work, including the people interacting with you on this thread in efforts to find a positive way forward. |
Thanks for the response Aaron.
That is what I basically did for some of the code bases to explore migration paths, but I am not a huge fan of this, because from my perspective this sometimes hurt the readability. Ω₀ and Ω_0 just look different and parse differently. I know, minor issues. I hope I am not making the impression that migrating such code bases manually is impossible. While I absolutely like providing automated migration paths via clang-tidy, I think for this task it is not the right tool.
Let me quickly comment on this, too, because I think this is wrong. It is from my perspective not a stretch comparing this to removing goto statements. Developing clear code that is easy to understand should be a high priority for any project, because it helps on the long term, making the code more maintainable. In the scientific computing community we have over the last decade moved towards using unicode in identifiers to write code that is as close as possible to the theoretical formulas and it seems to help developers and studens to better understand what is going on in less time - so it has become a good practice. Luckily it seems like larger c++ projects seem to have not catched up on this trend yet, so I think the impact (globally speaking, not on my code bases) is not as severe as I initially expected. Goto on the other hand is just bad practice in 99% of the cases where it is deployed. There are legitimate use cases, but even in such cases we can always rewrite the control flow to eliminate all gotos with higher level constructs. Still, I believe this breakage is more on the accidental side an I am looking forwards to use cases for which we want to block the removed characters, as mentioned previously in the thread. |
Thanks! I was kind of thinking the same thing, but confirmation is helpful. :-) My current thinking on this is that we don't want to expose a feature flag like |
For me this seems to be a great compromise. Thanks for taking the time! |
I spent some time reading back through this discussion and would like to correct a possible misconception that readers might have come away with. The Unicode standard, via UAX#31, specifies, and will continue to specify, three models of identifier syntax for language designers to follow. One of these (hashtag identifiers) is not relevant for C++. The other two are. The change made for C23 and C++23 was to migrate C++ from immutable identifiers to default identifiers. This change was made solely by the C and C++ standardization committees and not at the recommendation of the Unicode Consortium; as stated earlier, the Unicode standard will continue to specify multiple models of identifier syntax for language designers to use at their discretion. The possible misconception that I want to correct is that the change was the result of a Unicode Consortium recommendation; it wasn't. That all being said, the Unicode maintainers are aware of the issues we are encountering in migrating from immutable identifier syntax to default identifier syntax and will be reviewing the default identifier syntax character allowances for a future Unicode standard. |
@AaronBallman what you suggest sounds okay to me. I'm thinking about whether we want to allow people to disable the error in C++23/C23. I think we do because discouraging upgrade over this feature sounds like a net negative. But I would like us to have a long term plan to make sure there is no confusion that this is a temporary solution and not an allowance to deviate from the standard ad aeternam. |
I think that a long-term plan at least partially depends on whether UAX#31 expands the default identifiers tables sufficiently for the folks running into problems (or WG14/WG21 change the identifier set back to immutable, etc). I suspect the long-term plan will be to eventually speculatively turn the diagnostic back into an error-only diagnostic either after UAX#31 has been modified or after some number of Clang releases (whichever comes first), and see how much pain that causes folks in practice with pre-release testing. If there's still significant pain, we'd revert back to warning-defaults-to-error for a while longer. |
I've posted https://reviews.llvm.org/D132877 as the review for implementing what I proposed above. If it lands and there aren't concerns about backporting, I will try to get this backported to Clang 15. No promises about it making Clang 15 though, as the release is set to go out next Monday (so there are no more release candidates planned). |
I understand clang wanting to follow (draft) standards, but the rationale behind it is completely wrong; Unicode defining code points as being "identifier characters" or "mathematical characters" should bear no impact on any programming language; it's the programming language that is deciding what is a valid identifier character. As function names are identifiers, having so-called "mathematical characters" in them is not only sane, it's the better thing to do. Luckily other programming languages still understand this perfectly fine. Also breaking existing code without a true necessity is always a bad idea. I think clang should pull some weight and stand for their user base. |
Do clang 15 has any option to accept invalid characters. |
Not currently as of 15.0.4.
I'll leave it to @tahonermann and @cor3ntin to correct me if I'm wrong, but |
Yes, there is probably a bug in the width estimation code, I'll look into it. |
I think I'm a bit confused, @rayfalling.
So you're getting an error about use of that character in an identifier.
But this is about a comment and not an identifier. Can you attach a reduced test case that reproduces the issue for you so I can be sure we're considering the same situation? |
@AaronBallman
Our build system will use clang visitor to analyze the reflection macro In addition, the macro expansion should be empty in minimal code. |
I think the case @rayfalling is reporting is a lexing defect. Consider the following example (https://godbolt.org/z/j71Kjdr1P):
Clang issues the following diagnostics:
According to [lex.pptoken], the U+FF0C character should become its own preprocessing-token since it doesn't combine with any of the other token kinds. The initializers for |
Thank you for the example @rayfalling and thank you for the analysis @tahonermann! Tom, doesn't Clang's behavior match this: https://eel.is/c++draft/lex.pptoken#2.sentence-5 ? By my reading, <U+FF0C> runs into: and it qualifies as a single non-whitespace character that does not lexically match the other preprocessing token categories. Then we skip "If a U+0027 APOSTROPHE or a U+0022 QUOTATION MARK character matches the last category, the behavior is undefined." as it does not apply, bringing us to: "If any character not in the basic character set matches the last category, the program is ill-formed." <U+FF0C> is not in the basic character set, so the program is ill-formed. So to me, I think it's a case where the diagnostic is misleading and low-quality, but is actually correct. However, you have far more expertise on how to interpret this part of the standard. What am I misunderstanding? |
@AaronBallman This matches my understanding. |
Thanks @AaronBallman and @cor3ntin, it looks like you are right. A couple of interesting observations:
|
It looks like clang doesn't change his behavior when preprocessing full-width characters. Although this is not consistent with MSVC's behavior. I will change our code and build system to match clang preprocessing. Thanks for your answers. |
@rayfalling I realized I accidentally edited your reply instead of quoting it. I blame my phone. very sorry about that. I've looked onto that carret bug you reported and clang actually behaves correctly, but it will render incorrectly if the terminal you are using is not a proper column terminal. |
Clang now supports additional mathematical symbols in identifiers, as an extension for backward portability. Thanks for reporting this issue! |
Implement the proposed UAX Profile "Mathematical notation profile for default identifiers". This implements a not-yet approved Unicode for a vetted UAX31 identifier profile https://www.unicode.org/L2/L2022/22230-math-profile.pdf This change mitigates the reported disruption caused by the implementation of UAX31 in C++ and C2x, as these mathematical symbols are commonly used in the scientific community. Fixes llvm#54732 Reviewed By: tahonermann, #clang-language-wg Differential Revision: https://reviews.llvm.org/D137051
Any plan to implement the Emoji profile as well? See pg. 23 in https://www.unicode.org/L2/L2022/22229-prop-changes.pdf It's as much of a standard profile as the Math profile, only the Math profile has a separate PDF (https://www.unicode.org/L2/L2022/22230-math-profile.pdf ) explaining the rationale. Also, from the current version of UAX31 (https://www.unicode.org/reports/tr31/#Standard_Profiles):
Also see this document from the same author: https://www.unicode.org/L2/L2022/22102-non-xid-ident-usage.pdf I might also mention the following considerations in favor of making room for emojis somehow:
Thanks for considering! |
Some Unicode characters like ₊ (U+208A) and other subscripts are rejected by Clang 14. These characters are in the allowed ranges for identifiers in the
[lex.name]
section of the C++ Standard. Recent versions of GCC and older versions of Clang do not raise any errors.For example:
Is this a deliberate change or a regression bug from Clang 13 to 14?
The text was updated successfully, but these errors were encountered: