Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

jzabroski · 2018-03-23T15:39:59Z

From MSDN:

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

This is by far one of the biggest gotchas with using .NET Regex class.

I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.

Portability concerns
According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.

The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.

Backward compatibility concerns
Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.

Here is Petr Onderka (@svick)'s summary:

OS	Line-ending style	Current	Environment.NewLine	AnyNewLine
Windows	Windows	✗	✓	✓
Windows	Unix	✓	✗	✓
Unix	Windows	✗	✗	✓
Unix	Unix	✓	✓	✓

Api Proposal

edit by @ViktorHofer

namespace System.Text.RegularExpressions
{
    [Flags]
    public enum RegexOptions
    {
        None                    = 0x0000,
        IgnoreCase              = 0x0001, // "i"
        Multiline               = 0x0002, // "m"
        ExplicitCapture         = 0x0004, // "n"
        Compiled                = 0x0008, // "c"
        Singleline              = 0x0010, // "s"
        IgnorePatternWhitespace = 0x0020, // "x"
        RightToLeft             = 0x0040, // "r"

#if DEBUG
        Debug                   = 0x0080, // "d"
#endif

        ECMAScript              = 0x0100, // "e"
        CultureInvariant        = 0x0200,
+       AnyNewLine              = 0x0400 // Treat "$" as (?=[\r\n]|\z)
    }
}

API Review Notes

Video

Looks good. A few comments:

~~We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)~~ Spec updated so that AnyNewLine = 0x400 (1024).
~~The table looks wrong (Windows on Windows on the Current should work IMHO)~~ The table is correct. The fact this trips up experts just speaks to why this is a profound GOTCHA in the Core SDK.
~~May be AcceptAllLineEndings?~~ Some hallway testing I've done indicates AnyNewLine is a good name. Plus, (argument after the final name was chosen) this enumeration will be transliterated into a checkbox on Regular Expression Visualization tools like Regex Hero, so it is preferable to have a concise explanation for the feature to avoid excessive screen space.

PR Review Notes

After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning \Z anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.

Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.

AnyNewLine as Anchor Modifier to \Z and $ Anchors

flags	$ is treated as	$ documentation	\Z is treated as	\Z documentation
neither	`(?=\n\z\|\z)`	The match must occur at the end of the string or before `\n` at the end of the string.	(Same as `$` with this option.)	(Same as `$` with this option.)
`RegexOptions.Multiline`	`(?=\n\|\n\z\|\z)`	The match must occur at the end of the string or before `\n` anywhere in the string.	`(?=\n\z\|\z)`	The match must occur at the end of the string or before `\n` at the end of the string.
`RegexOptions.Multiline \| RegexOptions.AnyNewLine`	`(?=\r\n\|\r\|\n\|\r\n\z\|\r\z\|\n\z\|\z)`	The match must occur at the end of the string or before `\r\n`, `\n` or `\r` anywhere in the string.	`(?=\r\n\z\|\r\z\|\n\z\|\z)`	The match must occur at the end of the string or before `\r\n`, `\n` or `\r` at the end of the string.
`RegexOptions.AnyNewLine`	`(?=\r\n\z\|\r\z\|\n\z\|\z)`	The match must occur at the end of the string or before `\r\n`, `\n` or `\r` at the end of the string.	(Same as `$` with this option.)	(Same as `$` with this option.)

The text was updated successfully, but these errors were encountered:

svick · 2018-03-24T12:31:32Z

I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine.

Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not):

OS	Line-ending style	Current	`EnvironmentNewLine`	`AnyNewLine`
Windows	Windows	✗	✓	✓
Windows	Unix	✓	✗	✓
Unix	Windows	✗	✗	✓
Unix	Unix	✓	✓	✓

jzabroski · 2018-03-24T16:10:44Z

Petr, that is exactly what I meant if you looked at how I defined it: (?=[\r\n|\z) can match any mix of line endings, even files trashed by mixed line endings. AnyNewline is a MUCH better name than what I came up with. Love teamwork! Adding a portable sigil for Environment.NewLine probably isn't a bad idea, but the only pragmatic way I could think of doing it would be to export all Regex sigils and let the consumer of the API reprogram the sigil mapping. I thought about this, but felt it was too complicated and also ruin the tooling around current .NET regular expressions. Sites like RegexHero make writing Regex so easy that I wanted a solution that would be easy for tools to adopt. The only gotchas I typically encounter are forgetting to escape a pipe, and then forgetting about how line ending semantics works. Tools help me with the former, but the latter always trips me up.

…

On Sat, Mar 24, 2018, 8:32 AM Petr Onderka ***@***.***> wrote: I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine. Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not): OS Line-ending style Current EnvironmentNewLine AnyNewLine Windows Windows ✗ ✓ ✓ Windows Unix ✓ ✗ ✓ Unix Windows ✗ ✗ ✓ Unix Unix ✓ ✓ ✓ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/dotnet/corefx/issues/28410#issuecomment-375884537>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbT_WOjuiqujVh1zPVyNau26LElEe-aks5thjzMgaJpZM4S473p> .

jzabroski · 2018-03-24T16:15:04Z

Actually, I see now that in my original description my description of the behavior is inconsistent. Thank you for that wonderful table. I will update the first post with it.

jzabroski · 2018-03-26T15:38:39Z

@jnm2 @svick I updated the original description. Please give feedback.

ViktorHofer · 2018-03-27T10:01:42Z

I like the idea and its cross-platform benefit.

I think a setting that would match any line ending

How is any line ending defined? The name suggests that it can handle any line ending, are there more than Unix and Windows line ending chars? We could probably refine the name.

svick · 2018-03-27T17:39:32Z

@ViktorHofer

How is any line ending defined?

According to Wikipedia's summary of the Unicode standard, there are 8 character sequences that form a line break (7 single characters, including CR and LF, and the sequence CR+LF).

Though I think those other characters are not widely used, so supporting only either LF and CR+LF (Unix and Windows) or CR, LF and CR+LF (old Mac, Unix and Windows) would be fine, especially if it lead to better performance than fully supporting the Unicode definition of line break.

@jzabroski

(?=[\r\n]|\z)

I don't understand this regex, as far as I can tell, it wouldn't match \r\n, and so \r\n would be reported as two line breaks.

jzabroski · 2018-03-27T19:06:06Z

I'll write some test cases with XUnit and use that as a specification. We can then tweak the test cases to come up with a final specification/acceptance criteria. Fare could also be used to confirm.

ViktorHofer · 2018-05-24T18:05:17Z

potential hackathon candidate if we review the API addition in time.

cc @karelz

terrajobst · 2018-05-29T17:21:20Z

Video

Looks good. A few comments:

We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)
The table looks wrong (Windows on Windows on the Current should work IMHO)
May be AcceptAllLineEndings?

svick · 2018-05-29T18:25:23Z

@terrajobst

The table looks wrong (Windows on Windows on the Current should work IMHO)

I don't think the table is wrong, see the documentation linked in the top post, where it says that you should use \r?$ to match Windows-style line endings in multiline mode.

You can also try running code like the following:

var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);

foreach (char c in match.Value)
    Console.WriteLine($"{c} (0x{(int)c:X2})");

For me, running it on .Net Core 2.1 RTM on Windows 10 prints:

f (0x66)
o (0x6F)
o (0x6F)
 (0x0D)

While if this was "working well", I would expect to get just the three characters of "foo" as the match, not "foo\r".

jzabroski · 2018-05-29T19:23:05Z

@terrajobst I agree with @svick . Intuitively, all his table is saying is that neither $ nor Environment.NewLine is fully portable solution for the two ubiquitous platforms. AnyNewLine therefore proposes to fill that gap.

jzabroski · 2018-05-30T00:45:56Z

@terrajobst

I'll update the original post to modify the enum options to be 0x0400 (1024)

Mpdreamz · 2018-06-02T12:01:59Z

Taking a stab at this for the Hackaton

ViktorHofer · 2018-06-02T12:06:32Z

Ok, assigned you.

Mpdreamz · 2018-06-02T14:52:26Z

Had some fun this afternoon going through the internals of System.Text.RegularExpressions

I pushed what I have so far here: Mpdreamz/corefx@4c4d688

I might pick it up for fun at some point again, feel free to unassign me 😄

ViktorHofer · 2018-06-02T14:57:03Z

wow that's great progress! it's up to you if you want to continue working on it or pass it on to me / other community members.

danmoseley · 2019-09-10T15:50:48Z

Unassigning for someone else to look. Thanks for the start!

shishirchawla · 2019-09-12T05:47:42Z

Hi @danmosemsft, I would like to have a go at this. Can you please get this assigned to me. Thanks.

danmoseley · 2019-09-12T15:43:06Z

@shishirchawla sounds great, go ahead! BTW I sent you a collaborator invite. When you accept LMK and I will assign this formally.

danmoseley · 2019-09-12T20:08:17Z

Assigned, it's all yours @shishirchawla work on any cadence you want and just post here if you have any questions or issues.

shishirchawla · 2019-09-19T07:28:35Z

Hi @danmosemsft / @jzabroski ,

Heres my PR for the new API - dotnet/corefx#41195 . This is my first PR on the project, so please bare with me.

Looking forward to your feedback.

Shishir

jzabroski · 2019-10-07T17:23:20Z

@danmosemsft @terrajobst @ViktorHofer @svick It seems that during the process of implementing this feature, @shishirchawla discovered an area to improve the API a tiny bit further. If you go to the Anchors documentation, there are two anchors, $ and \Z, which mention Unix-specific \n line ending. I've updated the top post of this issue with the full notes, including @terrajobst API review notes, so that everything is consolidated in one post.

I like @shishirchawla 's improvement, but I think the final focus should be on how this is communicated in the Docs, such as https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions and ensuring full test coverage.

danmoseley · 2019-10-15T03:17:23Z

@terrajobst could you please give your take on the \Z change? Does this need to go back to review or is it arguably what was intended?

jzabroski · 2019-10-15T13:09:40Z

@danmosemsft So sorry I'm a foot dragger on this. Been trying to keep up with my own open source project, life, work, etc.

Relevant examples / test cases:

https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-line-
https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-before-ending-newline-z
https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-only-z

The examples are a bit idiosyncratic in that they reference Environment.NewLine, and it probably isn't immediately clear to the documentation writer that the example output will change depending on the runtime OS environment.

danmoseley · 2019-10-27T16:34:02Z

ping @terrajobst

terrajobst · 2019-11-04T23:30:33Z

The name AnyNewLine works for me. Unless someone of @dotnet/fxdc disagress, I wouldn't bring this back to API review. Our feedback was mostly around semantics, and that seems to have been take care of.

jzabroski · 2020-07-31T15:02:45Z

@danmosemsft Thanks for putting this together. This issue has been on my mind for some time, but life keeps getting in the way!

The main reason I did not put this on my plate is precisely the risk of causing a regression in behavior that you highlight by the thirteen different places that need to be considered. Perhaps I am too detail-oriented, but as we are nearing .NET Core 5, my thought was, "Maybe this should wait for the next minor version?"

In terms of your design proposal, I agree that making the code long-term easier to understand and maintain is a big win. Having the parser handle the phrase as a token makes sense, and also cuts down the risk of a major regression due to unforeseen feature combinations. It would also be easier to lift back up to the interpreter / compiler the token. Plus, you could have a compiler that analyzes the parse tree to determine if it is an "extended Regular Expression" or actual "regular language Regular Expression." Similar to how .NET Expression tree API allows preferInterpretation. In most cases, it would be convenient to have the high-level language services just transparently do what's optimal.

jzabroski · 2020-10-01T18:49:50Z

@stephentoub For .NET vNext, can you comment on @danmosemsft ideas to lift this: #25598 (comment)

stephentoub · 2020-10-02T21:06:51Z

I like the simplicity and isolated nature of that approach, and I think it's reasonable to start, at least for a prototype, and see a) whether it's functionally correct with all the various new tests we throw at it, and b) what the performance hit is without further optimization. Assuming it's functionally sound and the perf hit is reasonable, it sounds like a good way to go for now; if in the future the perf of it became more of an issue, we could address that in a variety of ways, either by adding more optimizations for the patterns it employs, or by switching away from doing it purely in the parser as a lowering step.

danmoseley · 2020-10-02T21:23:00Z

Now we just need a volunteer...

jzabroski · 2020-10-19T16:25:04Z

(Just got back from vacation.) I'm interested in doing it, and think I almost got my two open source projects in a state where they're ready for v5. Anyway, I think it would be a lot of fun to do - if I use that word. :)

danmoseley · 2020-10-19T17:31:53Z

Great, I assigned you @jzabroski . As @stephentoub mentions, you'll probably want to start by prototyping and getting perf measurements. Thanks!

danmoseley · 2022-03-18T17:38:56Z

@jzabroski I cleared your assignment.

jzabroski · 2022-03-18T18:07:17Z

@danmoseley Thanks. Sorry about not finishing this. I caught Covid shortly after you assigned this and my time for additional work outside my normal hours diminished (long recovery).

danmoseley · 2022-03-18T18:32:34Z

no problem! sorry to hear about that 😿 you're always welcome to contribute here or to anything else in the repo.

danmoseley · 2023-06-17T22:49:07Z

Returning to this briefly, I wanted to point out that the behavior described in the top post is not what we want. It says that "The match must occur at the end of the string or before \r\n, \n or \r at the end of the string."

That is not desirable, because \r\n is logically a single line break. This how it is treated by all text editors, that is what is intended when code (usually on Windows) writes it. Also, the Unicode regex standard TR18 explicitly indicates that they go together indivisibly. And it is the behavior of PCRE, Perl, Java etc when you use \R.

That's why in my table above, I have $ equivalent to (effectively) (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z) (in non multiline) and not (?=\r\n\z|\r\z|\n\z|\z) as in the topmost post.

danmoseley · 2023-06-18T06:54:10Z

@stephentoub I hacked up what lowering might look like -- see https://github.com/dotnet/runtime/compare/main...danmoseley:runtime:anynewline?expand=1#diff-548465bb4dbdd05024f0676c944e3c2b89677c48257bf28dfb0afccde31ded0aR19

I wrapped the pattern in an iterator with the idea that it could show "real" locations in the pattern in error messages, but not sure whether it's worth that abstraction.

Do you think this is a worthwhile direction?

Edit: The approach makes those handful of tests pass. I'd need to do more tests/inspection to see whether there's other positional state that would need correction on inserting into the pattern like this, and update that too.

I suspect perf with this lowering approach may be more of a worry than the parser. Once the flag exists, folks may update existing patterns to use it, which may then be a perf trap if they are "hot". Possibly adding \R would be less of a problem in this respect as it would require modifying the pattern, so less "easy", and by putting \R in a non capturing group or an assertion as appropriate you can achieve some of the same things as AnyNewLine albeit potentially in a clumsy way (but a standard way).

So a possible approach might be to implement \R for .NET 8 then at a future point add AnyNewLine but properly, as it's own node type that the engines accommodate.

(In this model \R would match any new line out of the gate, but that's the same as PCRE: by default in PCRE, . $ \Z recognize only \n but \R recognizes all newline flavors.)

danmoseley · 2023-08-13T21:19:16Z

I didn't like the parser approach -- it's clumsy and will run slow.
I'm working on supporting this properly, but in the generator only -- that way we can do it efficiently plus avoid adding any runtime cost for the regular case.

working in https://github.com/dotnet/runtime/compare/main...danmoseley:anynewline2?expand=1

stephentoub · 2023-08-13T21:22:35Z

I'm working on supporting this properly, but in the generator only

You mean someone else would add it for the other engines?

danmoseley · 2023-08-13T22:05:55Z

Potentially, in future, with care to avoid impacting perf in existing cases. As AnyNewLine would be new, I think it's OK if only the new engine supports it (meaning it will only work for static patterns) initially.

Just checking for more than \n is easy, the complexity is dealing with \r\n atomically

danmoseley · 2023-08-13T22:21:55Z

What are your thoughts, would it be acceptable to only support for static patterns (generator)

stephentoub · 2023-08-13T22:23:04Z

I think it's OK if only the new engine supports it (meaning it will only work for static patterns)

I'm not a fan of that. At a minimum, when we support it, it should be in all but the NonBacktracking engine, but ideally all of them. There are currently zero differences in supported functionality between the interpreter, compiler, and source generator; I'm not excited to add any. Plus, there are cases where the source generator will fall back to the other engines, eg case-insensitive back references, so whether this new option would work would be very confusing.

jzabroski changed the title ~~Regex Match, Split and Matches should support RegexOptions.EnvironmentNewLine as (?=[\r\n]|\z)~~ Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=[\r\n]|\z) Mar 26, 2018

ViktorHofer assigned Mpdreamz Jun 2, 2018

danmoseley unassigned Mpdreamz Sep 10, 2019

danmoseley assigned shishirchawla Sep 12, 2019

VS-ux mentioned this issue Aug 8, 2020

RegexOptions.Multiline should be same as line by line #40566

Closed

danmoseley mentioned this issue Aug 9, 2020

RegexOptions.Multiline not matching text #40577

Closed

danmoseley mentioned this issue Aug 18, 2020

Make enum RegexParseError and RegexParseException public #38872

Closed

danmoseley assigned jzabroski Oct 19, 2020

jzabroski mentioned this issue May 10, 2021

$ doesn't match CRLF rust-lang/regex#244

Closed

danmoseley mentioned this issue Mar 18, 2022

Regex engines only look for 'LF' line endings when using RegexOptions.Multiline mode as opposed to looking for 'CRLF' as well #66845

Closed

danmoseley unassigned jzabroski Mar 18, 2022

jzabroski mentioned this issue Jun 1, 2022

Regex examples should use "\z" not "$" for end-of-string dotnet/dotnet-api-docs#5422

Open

This was referenced Jun 10, 2023

Strange behavior with multiline Regex #87368

Closed

clarify newline behavior of regex dotnet/docs#35751

Merged

danmoseley self-assigned this Aug 13, 2023

danmoseley removed the help wanted [up-for-grabs] Good issue for external contributors label Aug 13, 2023

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

Comments

jzabroski commented Mar 23, 2018

Api Proposal

API Review Notes

PR Review Notes

AnyNewLine as Anchor Modifier to \Z and $ Anchors

svick commented Mar 24, 2018

jzabroski commented Mar 24, 2018 via email

jzabroski commented Mar 24, 2018

jzabroski commented Mar 26, 2018

ViktorHofer commented Mar 27, 2018

svick commented Mar 27, 2018 • edited Loading

jzabroski commented Mar 27, 2018

ViktorHofer commented May 24, 2018

terrajobst commented May 29, 2018 • edited Loading

svick commented May 29, 2018

jzabroski commented May 29, 2018

jzabroski commented May 30, 2018

Mpdreamz commented Jun 2, 2018

ViktorHofer commented Jun 2, 2018

Mpdreamz commented Jun 2, 2018

ViktorHofer commented Jun 2, 2018

danmoseley commented Sep 10, 2019

shishirchawla commented Sep 12, 2019

danmoseley commented Sep 12, 2019

danmoseley commented Sep 12, 2019

shishirchawla commented Sep 19, 2019

jzabroski commented Oct 7, 2019 • edited Loading

danmoseley commented Oct 15, 2019

jzabroski commented Oct 15, 2019

danmoseley commented Oct 27, 2019

terrajobst commented Nov 4, 2019

jzabroski commented Jul 31, 2020

jzabroski commented Oct 1, 2020

stephentoub commented Oct 2, 2020

danmoseley commented Oct 2, 2020

jzabroski commented Oct 19, 2020

danmoseley commented Oct 19, 2020

danmoseley commented Mar 18, 2022

jzabroski commented Mar 18, 2022

danmoseley commented Mar 18, 2022

danmoseley commented Jun 17, 2023

danmoseley commented Jun 18, 2023 • edited Loading

danmoseley commented Aug 13, 2023

stephentoub commented Aug 13, 2023

danmoseley commented Aug 13, 2023

danmoseley commented Aug 13, 2023

stephentoub commented Aug 13, 2023 • edited Loading

svick commented Mar 27, 2018 •

edited

Loading

terrajobst commented May 29, 2018 •

edited

Loading

jzabroski commented Oct 7, 2019 •

edited

Loading

danmoseley commented Jun 18, 2023 •

edited

Loading

stephentoub commented Aug 13, 2023 •

edited

Loading