-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal] Add cultureName constructors to GeneratedRegex #59492
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsIf a developer specifies RegexOptions.IgnoreCase to a regular expression, the implementation(s) then need to do casing operations to determine whether things match. That's implemented today in two parts:
If the developer specifies RegexOptions.CultureInvariant, that's fine, as both (1) and (2) will be done with CultureInfo.InvariantCulture. But if CultureInvariant isn't specified, it's possible that the CultureInfo.CurrentCulture at the time of (1) will differ from the time of (2), in which case the lowercasing can be performed with different cultures, leading to problems. Note that this really only affects the instance methods on Regex. The static methods on Regex access a cache of Regex instances, and the culture used to create the Regex instances is included in the key to that cache. Also note that the docs talk about the current culture being used, but they don't specify current "when". We have a few options to fix this:
We should investigate whether the discrepancy actually matters, e.g. in practice, do all regexes where this could matter:
My gut is we should do (2). If we do (2), there are then additional complications for the regex source generator. All the computation that normally happens in the Regex constructor happens at build time with the regex source generator, which means CultureInfo.CurrentCulture is determined at the time of build, whatever the culture is for the process doing the build / generating the source code. We have a few options there:
I'm tempted to say we should do (4), and if you use a ctor that doesn't take a culture, fall back to (3).
|
With all of the IgnoreCase work that happened during this release, some of the behavior here changed, so some of the statements in the original description of the issue are no longer accurate. Here is the current state of things:
I think the only action item here we may want to consider (and for which it is probably already late in case of 7.0) would be to add an option to the RegexGenerator attribute so it could optionally also take a particular culture to be used for computing the case equivalences. I'm punting this to Future for now, as having this extra option would be considered new API, and we are currently not taking any more new-APIs for 7.0. |
I think we need to decide now, in .NET 7, whether the source generator defaults to invariant. Additional API could be used to override that. Having the generated code be based on whatever ambient culture the roslyn process is running as is unfortunate. |
(And if we need to get in an additional ctor to the attribute in 7 in order to ensure a good story, we can do that.) |
TBH, the case mappings we use are basically the same across all cultures except for a handful of characters (all related to |
i is a fairly important letter. :-) My main concern, though, is determinism, reproducibility, and expectations. If you compile the exact same C# project on two different machines and they produce binaries with different semantics without having opted in to that, that seems concerning. |
that's a fair point. I'm fine with defaulting to |
To be clear, I'm not dictating we must change it... I just want us to have the conversation in .NET 7 and explicitly decide what to do rather than punting, at which point the decision is made for us. |
Document it on docs.microsoft.com for Regular Expressions RegexOptions flags as an FYI for the IgnoreCase and CultureInvariant flags. Done. In general, I think that table needs some sprucing up. Sometimes its better to focus efforts on documentation than writing more code. If people can read the docs, they can suggest enhancements. At present, there is the defensible argument that it has always worked this way (at worst). It doesn't sound like there are any explicit user stories for this. That said, it would be nice if there was a nuget package that contained StringTestData and contained some example double wide characters that would cause a string reader that is using per-byte logic to fail on a double wide character across one encoding but succeed on another. I have had use for such test data construction in the past, and it surprises me there isn't really a great library for that anywhere, especially as an English-only speaker. I know ParseCombiningCharacters, etc. exists but the API is also very low-level and most engineers I ask about this are unaware how to use that API, even. - My own need for this was unrelated to regular expressions but rather handling Oracle long SQL text concatenations that would otherwise fail at 4,000 chars - Hope I'm making sense. |
That the culture employed by the regex depends on the culture the Roslyn compiler server process was running with when the user's assembly was compiled in whatever build lab or cloud build environment was used? |
Oof. I did not consider that. Defaulting to InvariantCulture makes sense, but there should be a way to procedurally embed a different culture. I had discussed in a separate issue a way to pass serializable "Cargo" via C# 10 generic attributes, but I don't think generic attributes are planned as part of .NET 7? I heard they were a preview and pushed to C# 11 and whatever .NET major version that coincides with. Either way, you could pass a culture Func via some higher order means, and call it once per generator. The .NET Framework Design Guidelines may not have strong opinions on this, but it might be better than coupling the generator to the user's build machine. EDIT: this was my hacky idea: #4525 (comment) |
As opposed to declaratively that the additional overload accepting a culture would support? Why? What's the scenario for that? |
I was thinking it would be more discoverable to the API end-user. If the RegexGeneratorAttribute has a parameter to control that, they would be more aware of the potential to embed the build machine's culture info at time of generation. But, when you ask the question the way you did, I do feel a bit silly proposing that. So, to sum it up, I saw this issue as two issues:
|
@stephentoub I re-read this whole thread and I can see why you didn't follow my logic. My poignant feedback is:
I don't like passing in a string for the culture name. You can customize cultures. Just my opinion, but if this is the API for the next 10 years, it would be ideal if the Regex constructor took in an actual I suppose the problem with passing in a Hopefully this is a bit more constructive, albeit you're much smarter than me, so probably already thought of all this. |
We're talking about the source generator here. The problem with passing in a CultureInfo is this is all at build time; there's no user code being executed. |
In addition to @stephentoub reply:
The documentation can be a little bit misleading but what you mentioned here is not true. The CurrentCulture is not tied to specific thread and can be used in any other threads. |
I see. (4) with fallback to (3) is the ideal, in my opinion. If in the future .NET makes Attribute metadata more robust, the RegexGenerator can be extended to add a CultureInfo parameter. Just wondering, is the full proposal for (4) to:
Assuming documentation explicitly calls out cultureName must be a BCP-47 tag. and, if not already considered:
|
Ok so after thinking a bit more about this I agree with @stephentoub that we should change to default to InvariantCulture. I don't think that it is a good experience to have the exact same source code produce two different behaviors depending on where it gets built, especially when a lot of times you don't control where is your code getting built (for example when you use build servers that may be geo-distributed). I'll push a PR that makes this change because we do want this to be part of .NET 7. As for the GeneratedRegex constructor that allows to pass in a culture, do we still think there is time to consider that API, or would you be ok with pushing this consideration for vNext? |
I guess we are not sure yet as we haven't designed that API just yet, but the high-level idea would be that the source generated regex defaults to use InvariantCulture mappings, and we would also provide some version of the attribute in the future where you can select a different one if you wanted to.
Regex actually has it's own globalization casing data, so if you set the AppContext swith for using invariant globalization, we detect that and use our own invariant culture mappings. |
cf - guess the docs should be updated? It does use AsyncLocal 🤷 https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Globalization/CultureInfo.cs#L378-L397
I am always pro hardening code vs. hitting moving targets. I would move it to vNext, and document the default culture change as a breaking change to the library. Given the number of bug fixes .NET 7 addresses with things that actually have globalization casing scenarios and how busted it was before, I think this is probably an acceptable breaking change? What would actually be a real breaking change a customer would complain about? |
The source generator engine is new in .NET 7, so this particular API we are talking about is not a breaking change. The overall changes we did with IgnoreCase during this release do have some breaking changes against different versions, but we have tried keeping them to the minimum, and for cases where we are breaking, we are doing so in order to "correct" behavior (for example, previously we would check the culture twice, once when the regex is created and once at match time and that could lead to inconsistencies, that is now fixed, but it will be breaking for people that relied on that inconsistency.) |
There's still time. We just need to do it asap if we're going to. My suggestion would be:
@GrabYourPitchforks, @tarekgh, any opinions about this direction? |
100% agree, this is my main concern as well. It's a very subtle, but significant break for determinism to do this. Enough so that I've added this API, and a few others, to the list of "banned APIs" we'll be checking generators against with an analyzer we have in the works. Essentially APIs we know to be dangerous in generators / analyzers that we want to push customers to stop using because it impacts these type of scenarios.
This seems very reasonable to me. |
Makes sense, I will send out a PR soon that addresses all @stephentoub's points above. @jaredpar that means there is no need on adding this API to the "banned APIs" list 😄 |
Oh the culture APIs are still being added to the ban list for reference of future customers. 😄 There are a lot of APIs that are effectively banned in the compiler and an analyzer is the best way to communicate that to generator / analyzer authors. Some of the banned APIs are more obvious, like don't do file IO, but others like culture are more subtle. I hadn't even thought of culture as a specific problem until reading this thread. |
This comment was marked as outdated.
This comment was marked as outdated.
namespace System.Text.RegularExpressions;
[AttributeUsage(AttributeTargets.Method, AllowMultiple = false, Inherited = false)]
public sealed class GeneratedRegexAttribute : System.Attribute
{
public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { }
public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options) { }
+ public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, string cultureName) { }
public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, int matchTimeoutMilliseconds) { }
+ public GeneratedRegexAttribute([StringSyntax(StringSyntaxAttribute.Regex, "options")] string pattern, RegexOptions options, int matchTimeoutMilliseconds, string cultureName) { }
+ public string CultureName { get; }
} |
This issue has now been turned into an API proposal for the missing GeneratedRegex constructors that take in a cultureName. The previous description of the issue has been moved bellow
Background and motivation
Regex engines are able to do case insensitive searches when you use
RegexOptions.IgnoreCase
or when you use a pattern that signals that it is case-insensitive, like:(?i)abc
. For better performance, our engines translate the passed in patterns into their case-sensitive equivalence so that we can benefit of some optimizations like using vectorized search, for example, a pattern(?i)abc
will be translated into the equivalent[Aa][Bb][Cc]
.Case conversions are not standard across all cultures, which means that when making that translation, it matters a lot which culture is used in order to produce the translation. For most of our engines (interpreted, Compiled, nonBacktracking) this translation happens at runtime, and will use the
CultureInfo.CurrentCulture
at runtime in order to perform this translation. The remaining engine, however, is different. Because it is the source generated engine, it means that this translation will happen at build-time of the application, which means that the culture won't necessarily be the same as the one you have at runtime. Today, the source generated engine will use the Culture at build-time to perform the translation, and there is no way to select a different culture, which is problematic, since even if you are able to change the culture of your machine if you are building the application, this won't be possible if the code gets built in a build server. This proposal is to add an extra string parameter to theGeneratedRegex
attribute so that you can optionally select which culture should be used to perform the translation.API Proposal
API Usage
Draft PR with the implementation
#73490
Original Issue Description
If a developer specifies RegexOptions.IgnoreCase to a regular expression (or uses the inline ignore casing option in the expression), the implementation(s) then need to do casing operations to determine whether things match. That's implemented today in two parts:
If the developer specifies RegexOptions.CultureInvariant, that's fine, as both (1) and (2) will be done with CultureInfo.InvariantCulture. But if CultureInvariant isn't specified, it's possible that the CultureInfo.CurrentCulture at the time of (1) will differ from the time of (2), in which case the lowercasing can be performed with different cultures, leading to problems. Note that this really only affects the instance methods on Regex. The static methods on Regex access a cache of Regex instances, and the culture used to create the Regex instances is included in the key to that cache. Also note that the docs talk about the current culture being used, but they don't specify current "when".
We have a few options to fix this:
We should investigate whether the discrepancy actually matters, e.g. in practice, do all regexes where this could matter:
i
s)My gut is we should do (2), primarily as it's explainable and simple. We're already doing some such computation at construction time, so it's unlikely a robust dependency could have been taken on the use of the culture only at match time.
If we do (2), there are then additional complications for the regex source generator. All the computation that normally happens in the Regex constructor happens at build time with the regex source generator, which means CultureInfo.CurrentCulture is determined at the time of build, whatever the culture is for the process doing the build / generating the source code. We have a few options there:
i
option is specified somewhere in the expression) and RegexOptions.CultureInvariant wasn't specifiedstring cultureName
, which is then used. Either get rid of the other overloads, or have them do one of the other options.I'm tempted to say we should do (4), and if you use a ctor that doesn't take a culture, fall back to (3).
cc: @GrabYourPitchforks, @tarekgh
Related:
#36147
#58958
The text was updated successfully, but these errors were encountered: