Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

Open
jzabroski opened this issue Mar 23, 2018 · 50 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@jzabroski
Copy link
Contributor

From MSDN:

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

This is by far one of the biggest gotchas with using .NET Regex class.

I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.

Portability concerns
According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.

The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.

Backward compatibility concerns
Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.

Here is Petr Onderka (@svick)'s summary:

OS Line-ending style Current Environment.NewLine AnyNewLine
Windows Windows
Windows Unix
Unix Windows
Unix Unix

Api Proposal

edit by @ViktorHofer

namespace System.Text.RegularExpressions
{
    [Flags]
    public enum RegexOptions
    {
        None                    = 0x0000,
        IgnoreCase              = 0x0001, // "i"
        Multiline               = 0x0002, // "m"
        ExplicitCapture         = 0x0004, // "n"
        Compiled                = 0x0008, // "c"
        Singleline              = 0x0010, // "s"
        IgnorePatternWhitespace = 0x0020, // "x"
        RightToLeft             = 0x0040, // "r"

#if DEBUG
        Debug                   = 0x0080, // "d"
#endif

        ECMAScript              = 0x0100, // "e"
        CultureInvariant        = 0x0200,
+       AnyNewLine              = 0x0400 // Treat "$" as (?=[\r\n]|\z)
    }
}

API Review Notes

Video

Looks good. A few comments:

  • We cannot use the proposed value of 128 because it's already taken (see #if DBG in code) Spec updated so that AnyNewLine = 0x400 (1024).
  • The table looks wrong (Windows on Windows on the Current should work IMHO) The table is correct. The fact this trips up experts just speaks to why this is a profound GOTCHA in the Core SDK.
  • May be AcceptAllLineEndings? Some hallway testing I've done indicates AnyNewLine is a good name. Plus, (argument after the final name was chosen) this enumeration will be transliterated into a checkbox on Regular Expression Visualization tools like Regex Hero, so it is preferable to have a concise explanation for the feature to avoid excessive screen space.

PR Review Notes

After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning \Z anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.

Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.

AnyNewLine as Anchor Modifier to \Z and $ Anchors

flags $ is treated as $ documentation \Z is treated as \Z documentation
neither (?=\n\z|\z) The match must occur at the end of the string or before \n at the end of the string. (Same as $ with this option.) (Same as $ with this option.)
RegexOptions.Multiline (?=\n|\n\z|\z) The match must occur at the end of the string or before \n anywhere in the string. (?=\n\z|\z) The match must occur at the end of the string or before \n at the end of the string.
RegexOptions.Multiline | RegexOptions.AnyNewLine (?=\r\n|\r|\n|\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r anywhere in the string. (?=\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string.
RegexOptions.AnyNewLine (?=\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string. (Same as $ with this option.) (Same as $ with this option.)
@svick
Copy link
Contributor

svick commented Mar 24, 2018

I don't like that this would depend on Environment.NewLine. Transferring files between different OSes is common, so I think there is a good chance you will need to process files with different line endings than those that are native to your OS. Instead, I think a setting that would match any line ending (independent of the current OS) would be better. It could be called something like AnyNewLine.

Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not):

OS Line-ending style Current EnvironmentNewLine AnyNewLine
Windows Windows
Windows Unix
Unix Windows
Unix Unix

@jzabroski
Copy link
Contributor Author

jzabroski commented Mar 24, 2018 via email

@jzabroski
Copy link
Contributor Author

Actually, I see now that in my original description my description of the behavior is inconsistent. Thank you for that wonderful table. I will update the first post with it.

@jzabroski jzabroski changed the title Regex Match, Split and Matches should support RegexOptions.EnvironmentNewLine as (?=[\r\n]|\z) Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=[\r\n]|\z) Mar 26, 2018
@jzabroski
Copy link
Contributor Author

@jnm2 @svick I updated the original description. Please give feedback.

@ViktorHofer
Copy link
Member

I like the idea and its cross-platform benefit.

I think a setting that would match any line ending

How is any line ending defined? The name suggests that it can handle any line ending, are there more than Unix and Windows line ending chars? We could probably refine the name.

@svick
Copy link
Contributor

svick commented Mar 27, 2018

@ViktorHofer

How is any line ending defined?

According to Wikipedia's summary of the Unicode standard, there are 8 character sequences that form a line break (7 single characters, including CR and LF, and the sequence CR+LF).

Though I think those other characters are not widely used, so supporting only either LF and CR+LF (Unix and Windows) or CR, LF and CR+LF (old Mac, Unix and Windows) would be fine, especially if it lead to better performance than fully supporting the Unicode definition of line break.

@jzabroski

(?=[\r\n]|\z)

I don't understand this regex, as far as I can tell, it wouldn't match \r\n, and so \r\n would be reported as two line breaks.

@jzabroski
Copy link
Contributor Author

I'll write some test cases with XUnit and use that as a specification. We can then tweak the test cases to come up with a final specification/acceptance criteria. Fare could also be used to confirm.

@ViktorHofer
Copy link
Member

potential hackathon candidate if we review the API addition in time.

cc @karelz

@terrajobst
Copy link
Member

terrajobst commented May 29, 2018

Video

Looks good. A few comments:

  • We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)
  • The table looks wrong (Windows on Windows on the Current should work IMHO)
  • May be AcceptAllLineEndings?

@svick
Copy link
Contributor

svick commented May 29, 2018

@terrajobst

The table looks wrong (Windows on Windows on the Current should work IMHO)

I don't think the table is wrong, see the documentation linked in the top post, where it says that you should use \r?$ to match Windows-style line endings in multiline mode.

You can also try running code like the following:

var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);

foreach (char c in match.Value)
    Console.WriteLine($"{c} (0x{(int)c:X2})");

For me, running it on .Net Core 2.1 RTM on Windows 10 prints:

f (0x66)
o (0x6F)
o (0x6F)
 (0x0D)

While if this was "working well", I would expect to get just the three characters of "foo" as the match, not "foo\r".

@jzabroski
Copy link
Contributor Author

@terrajobst I agree with @svick . Intuitively, all his table is saying is that neither $ nor Environment.NewLine is fully portable solution for the two ubiquitous platforms. AnyNewLine therefore proposes to fill that gap.

@jzabroski
Copy link
Contributor Author

@terrajobst

I'll update the original post to modify the enum options to be 0x0400 (1024)

@Mpdreamz
Copy link
Contributor

Mpdreamz commented Jun 2, 2018

Taking a stab at this for the Hackaton

@ViktorHofer
Copy link
Member

Ok, assigned you.

@Mpdreamz
Copy link
Contributor

Mpdreamz commented Jun 2, 2018

Had some fun this afternoon going through the internals of System.Text.RegularExpressions

I pushed what I have so far here: Mpdreamz/corefx@4c4d688

I might pick it up for fun at some point again, feel free to unassign me 😄

@ViktorHofer
Copy link
Member

wow that's great progress! it's up to you if you want to continue working on it or pass it on to me / other community members.

@danmoseley
Copy link
Member

Unassigning for someone else to look. Thanks for the start!

@shishirchawla
Copy link

Hi @danmosemsft, I would like to have a go at this. Can you please get this assigned to me. Thanks.

@danmoseley
Copy link
Member

@shishirchawla sounds great, go ahead! BTW I sent you a collaborator invite. When you accept LMK and I will assign this formally.

@danmoseley
Copy link
Member

Assigned, it's all yours @shishirchawla work on any cadence you want and just post here if you have any questions or issues.

@shishirchawla
Copy link

Hi @danmosemsft / @jzabroski ,

Heres my PR for the new API - dotnet/corefx#41195 . This is my first PR on the project, so please bare with me.

Looking forward to your feedback.

Shishir

@jzabroski
Copy link
Contributor Author

jzabroski commented Oct 7, 2019

@danmosemsft @terrajobst @ViktorHofer @svick It seems that during the process of implementing this feature, @shishirchawla discovered an area to improve the API a tiny bit further. If you go to the Anchors documentation, there are two anchors, $ and \Z, which mention Unix-specific \n line ending. I've updated the top post of this issue with the full notes, including @terrajobst API review notes, so that everything is consolidated in one post.

I like @shishirchawla 's improvement, but I think the final focus should be on how this is communicated in the Docs, such as https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions and ensuring full test coverage.

@danmoseley
Copy link
Member

@terrajobst could you please give your take on the \Z change? Does this need to go back to review or is it arguably what was intended?

@jzabroski
Copy link
Contributor Author

@danmosemsft So sorry I'm a foot dragger on this. Been trying to keep up with my own open source project, life, work, etc.

Relevant examples / test cases:

https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-line-
https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-before-ending-newline-z
https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-only-z

The examples are a bit idiosyncratic in that they reference Environment.NewLine, and it probably isn't immediately clear to the documentation writer that the example output will change depending on the runtime OS environment.

@danmoseley
Copy link
Member

ping @terrajobst

@terrajobst
Copy link
Member

The name AnyNewLine works for me. Unless someone of @dotnet/fxdc disagress, I wouldn't bring this back to API review. Our feedback was mostly around semantics, and that seems to have been take care of.

@jzabroski
Copy link
Contributor Author

@danmosemsft Thanks for putting this together. This issue has been on my mind for some time, but life keeps getting in the way!

The main reason I did not put this on my plate is precisely the risk of causing a regression in behavior that you highlight by the thirteen different places that need to be considered. Perhaps I am too detail-oriented, but as we are nearing .NET Core 5, my thought was, "Maybe this should wait for the next minor version?"

In terms of your design proposal, I agree that making the code long-term easier to understand and maintain is a big win. Having the parser handle the phrase as a token makes sense, and also cuts down the risk of a major regression due to unforeseen feature combinations. It would also be easier to lift back up to the interpreter / compiler the token. Plus, you could have a compiler that analyzes the parse tree to determine if it is an "extended Regular Expression" or actual "regular language Regular Expression." Similar to how .NET Expression tree API allows preferInterpretation. In most cases, it would be convenient to have the high-level language services just transparently do what's optimal.

@jzabroski
Copy link
Contributor Author

@stephentoub For .NET vNext, can you comment on @danmosemsft ideas to lift this: #25598 (comment)

@stephentoub
Copy link
Member

I like the simplicity and isolated nature of that approach, and I think it's reasonable to start, at least for a prototype, and see a) whether it's functionally correct with all the various new tests we throw at it, and b) what the performance hit is without further optimization. Assuming it's functionally sound and the perf hit is reasonable, it sounds like a good way to go for now; if in the future the perf of it became more of an issue, we could address that in a variety of ways, either by adding more optimizations for the patterns it employs, or by switching away from doing it purely in the parser as a lowering step.

@danmoseley
Copy link
Member

Now we just need a volunteer...

@jzabroski
Copy link
Contributor Author

(Just got back from vacation.) I'm interested in doing it, and think I almost got my two open source projects in a state where they're ready for v5. Anyway, I think it would be a lot of fun to do - if I use that word. :)

@danmoseley
Copy link
Member

Great, I assigned you @jzabroski . As @stephentoub mentions, you'll probably want to start by prototyping and getting perf measurements. Thanks!

@danmoseley
Copy link
Member

@jzabroski I cleared your assignment.

@jzabroski
Copy link
Contributor Author

@danmoseley Thanks. Sorry about not finishing this. I caught Covid shortly after you assigned this and my time for additional work outside my normal hours diminished (long recovery).

@danmoseley
Copy link
Member

no problem! sorry to hear about that 😿 you're always welcome to contribute here or to anything else in the repo.

@danmoseley
Copy link
Member

Returning to this briefly, I wanted to point out that the behavior described in the top post is not what we want. It says that "The match must occur at the end of the string or before \r\n, \n or \r at the end of the string."

That is not desirable, because \r\n is logically a single line break. This how it is treated by all text editors, that is what is intended when code (usually on Windows) writes it. Also, the Unicode regex standard TR18 explicitly indicates that they go together indivisibly. And it is the behavior of PCRE, Perl, Java etc when you use \R.

That's why in my table above, I have $ equivalent to (effectively) (?=\r\n\z|\r?\z)|(?<!\r)(?=\n\z) (in non multiline) and not (?=\r\n\z|\r\z|\n\z|\z) as in the topmost post.

@danmoseley
Copy link
Member

danmoseley commented Jun 18, 2023

@stephentoub I hacked up what lowering might look like -- see https://github.com/dotnet/runtime/compare/main...danmoseley:runtime:anynewline?expand=1#diff-548465bb4dbdd05024f0676c944e3c2b89677c48257bf28dfb0afccde31ded0aR19

I wrapped the pattern in an iterator with the idea that it could show "real" locations in the pattern in error messages, but not sure whether it's worth that abstraction.

Do you think this is a worthwhile direction?

Edit: The approach makes those handful of tests pass. I'd need to do more tests/inspection to see whether there's other positional state that would need correction on inserting into the pattern like this, and update that too.

I suspect perf with this lowering approach may be more of a worry than the parser. Once the flag exists, folks may update existing patterns to use it, which may then be a perf trap if they are "hot". Possibly adding \R would be less of a problem in this respect as it would require modifying the pattern, so less "easy", and by putting \R in a non capturing group or an assertion as appropriate you can achieve some of the same things as AnyNewLine albeit potentially in a clumsy way (but a standard way).

So a possible approach might be to implement \R for .NET 8 then at a future point add AnyNewLine but properly, as it's own node type that the engines accommodate.

(In this model \R would match any new line out of the gate, but that's the same as PCRE: by default in PCRE, . $ \Z recognize only \n but \R recognizes all newline flavors.)

@danmoseley danmoseley self-assigned this Aug 13, 2023
@danmoseley danmoseley removed the help wanted [up-for-grabs] Good issue for external contributors label Aug 13, 2023
@danmoseley
Copy link
Member

I didn't like the parser approach -- it's clumsy and will run slow.
I'm working on supporting this properly, but in the generator only -- that way we can do it efficiently plus avoid adding any runtime cost for the regular case.

working in https://github.com/dotnet/runtime/compare/main...danmoseley:anynewline2?expand=1

@stephentoub
Copy link
Member

I'm working on supporting this properly, but in the generator only

You mean someone else would add it for the other engines?

@danmoseley
Copy link
Member

Potentially, in future, with care to avoid impacting perf in existing cases. As AnyNewLine would be new, I think it's OK if only the new engine supports it (meaning it will only work for static patterns) initially.

Just checking for more than \n is easy, the complexity is dealing with \r\n atomically

@danmoseley
Copy link
Member

What are your thoughts, would it be acceptable to only support for static patterns (generator)

@stephentoub
Copy link
Member

stephentoub commented Aug 13, 2023

I think it's OK if only the new engine supports it (meaning it will only work for static patterns)

I'm not a fan of that. At a minimum, when we support it, it should be in all but the NonBacktracking engine, but ideally all of them. There are currently zero differences in supported functionality between the interpreter, compiler, and source generator; I'm not excited to add any. Plus, there are cases where the source generator will fall back to the other engines, eg case-insensitive back references, so whether this new option would work would be very confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

No branches or pull requests

10 participants