Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new regex option RegexOptions.AnyNewLine #1449

Closed

Conversation

shishirchawla
Copy link

@shishirchawla shishirchawla commented Jan 8, 2020

@jzabroski
Copy link
Contributor

@shishirchawla I looked into this build failure.

Couple of things:

  1. The warnings about not being able to convert from LC_ALL to US-en.UTF-8 are a bit worrisome. Most likely not related to your check-in, probably a pre-existing warning the .NET build team should investigate on OSX.
  2. The failure indicated by https://helix.dot.net/api/2019-06-17/jobs/90a814ab-72c0-4914-941d-6afaafc4a328/workitems/System.Security.Cryptography.OpenSsl.Tests/console suggests some weird stuff regarding how the .NET Runtime build works - there is https://github.com/dotnet/runtime/blob/3b9abae5aae537258611b5750dc7e2963cfc73ed/src/libraries/System.Security.Cryptography.OpenSsl/src/System/Security/Cryptography/RSAOpenSsl.cs but also https://github.com/dotnet/runtime/blob/3b9abae5aae537258611b5750dc7e2963cfc73ed/src/libraries/Common/src/System/Security/Cryptography/RSAOpenSsl.cs

I

@danmoseley
Copy link
Member

OpenSSL failures are #1129 and can be ignored.

The warnings must be unrelated - I didn't look at them though.

All this needs is a code review signoff.

@@ -299,6 +299,39 @@ public static IEnumerable<object[]> Match_Basic_TestData()

// Surrogate pairs splitted up into UTF-16 code units.
yield return new object[] { @"(\uD82F[\uDCA0-\uDCA3])", "\uD82F\uDCA2", RegexOptions.CultureInvariant, 0, 2, true, "\uD82F\uDCA2" };

// AnyNewLine (with none of the special characters used as line ending)
yield return new object[] { @"line3\nline4$", "line1\nline2\nline3\nline4", RegexOptions.AnyNewLine, 0, 23, true, "line3\nline4" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add tests with \z and it's combinations here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the PR, but one way or another we sohuld have tests that cover the whole of my table. It might be good to include it as a comment even given the amount of noodling we did about it.

dotnet/corefx#41195 (comment)

Subsequent to this PR, one of us should put a PR up against the API docs to include something like this table.

@stephentoub
Copy link
Member

@shishirchawla, thanks for this. I merged some extensive changes to Regex this morning. I don't think it should much impact your changes, but there are some conflicts. Can you rebase? Then I can review.

@shishirchawla
Copy link
Author

@shishirchawla, thanks for this. I merged some extensive changes to Regex this morning. I don't think it should much impact your changes, but there are some conflicts. Can you rebase? Then I can review.

Done.

@stephentoub
Copy link
Member

High-level questions:

  1. It looks like this change doesn't impact the behavior of .. Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n], i.e. it matches everything except for a new line. That's the same new line that's matched by $ and whatnot. This PR is changing the new lines allowed by $ and friends... should it not also change the newlines that . maps to?

  2. This is allowing a new line to be '\n' and \r\n, but also \r. Are there any platforms in use today that use \r as a new line? Do any other regex implementations by any other modern language / framework allow treating \r as a new line? I'm trying to understand the use case for that.

@jzabroski
Copy link
Contributor

High-level questions:

  1. It looks like this change doesn't impact the behavior of .. Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n], i.e. it matches everything except for a new line. That's the same new line that's matched by $ and whatnot. This PR is changing the new lines allowed by $ and friends... should it not also change the newlines that . maps to?

Good point. I have a summary of Jeffrey Freidl's breakdown of anchor modifiers, here: https://github.com/jzabroski/Home/tree/master/RegularExpressions

  1. This is allowing a new line to be '\n' and \r\n, but also \r. Are there any platforms in use today that use \r as a new line? Do any other regex implementations by any other modern language / framework allow treating \r as a new line? I'm trying to understand the use case for that.

I was not a fan of '\r' as new line, as the only popular OS to adopt this in the last ~30 years was "classic Mac OS". Source: Wikipedia's Newline page. However, beggars cannot be choosers, and so if the .NET maintainers wish to allow matching \r, I see no great harm.

The current line anchor mode in .NET has taken its ancestry from Perl, and was copied by Java, Python, Delphi, and became the PCRE (Perl Compatible Regular Expressions) standard. The mistake, in my opinion, is that PCRE is only useful on UNIX systems and so fixing "end of line" to be \n isn't portable.

Background Noise: In his book, Mastering Regular Expressions, Jeffrey Freidl traces the behavior of how \Z works all the way back to Ken Thompson creating ed, and Alfred Aho isolating the search code in that editor to become grep and the introduction of the metacharacters we discuss now. However, grep was not a free-standing library and so until 1986-1987, there was no way for programmers to "call" a regex like a function. It was Larry Wall introducing his conventions in Perl that leave us with where we are today, where Regex's aren't strictly portable across Windows and Linux, ergo why I submitted this issue in the first place.

@stephentoub
Copy link
Member

Thank you for the follow-up, though some of the questions I asked are I think still unanswered:

  • "should it not also change the newlines that . maps to?" I've not looked to see how complicated this would be, but it could have non-trivial perf implications, as it means . would need to map to something like (?:\n|\r\n) (or (?:\n|\r\n|\r) if we allow \r). However, it seems wrong to me to have the meaning of . diverge from the meaning of the anchors.
  • "Do any other regex implementations by any other modern language / framework allow treating \r as a new line?" I know we consume \r as a valid newline in StreamReader, but that was also done in the days where there was such an OS. I'm not convinced it's worth incurring the additional costs for something that's not actually meaningful today, though there's an argument to be made for consistency with StreamReader. @JeremyKuhne, do you have an opinion?

@jzabroski
Copy link
Contributor

  • "should it not also change the newlines that . maps to?"

Yes, it should.

  • "Do any other regex implementations by any other modern language / framework allow treating \r as a new line?"

I think this is a matter of taste that no one person should decide, but I was unaware of StreamReader behaving that way, and it does seem nicer from a testing standpoint to have uniform assumptions.

Keep in mind, this flag is opt-in only. So, while it might be "slower" to match this way, users have to opt-in, and when they do, it's most likely so they can use the metacharacters to compactly express intent, not worry about performance. Frankly, having loaded NYSE tick data into kdb+, mechanical sympathy between mmap, bit splitter and target data store is tops in optimizing data loads. This is about every day usability and portability to get stuff done when you need a handy regex.

@danmoseley
Copy link
Member

danmoseley commented Jan 10, 2020

It may have been me that introduced \r into the original conversation by putting it in the lookup table. I guess I figured "any new line" doesn't suggest a subset and maybe there's some datasets that use \r.

Admittedly Wikipedia tells me there are several other even more obscure newlines so I guess we would still be drawing a line somewhere. On balance given the name of the option I suggest to include \r but I could go either way.

if (UseOptionA())
AddUnitType(UseOptionM() ? RegexNode.AnyEol : RegexNode.AnyEndZ);
else
AddUnitType(UseOptionM() ? RegexNode.Eol : RegexNode.EndZ);
break;

case '.':
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub regarding "should it not also change the newlines that . maps to?", does this look right, assuming that we are allowing '\n', '\r' and '\r\n' ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unicode spec I linked recommends that . can match \r\n as if it was a single character:

Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)

To achieve that is more complicated than the [^\r\n]. It suggests this (including all the newline characters):

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

Copy link
Member

@danmoseley danmoseley Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern above is missing a paren. I think in .NET that would be (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])

(edited -- the hyphen means that 0A through 0D are individually all valid newlines)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other implementations follow this standard in general?

As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from.

tl;dr: Yes, agree, TR18 seems perfect:

  1. Nobody can say .NET made an arbitrary, quick decision and missed something obvious
  2. Easier for new C# developers to on-board from other languages

Thanks for finding this standard. I was unaware of it, despite reading Freidl's Mastering Regular Expressions. I searched through my PDF copy of the book, and he doesn't mention TR18. I guess his book was last updated in 2006, and TR18 was published in 2000, so maybe it's taken time for it to become popular. Amusingly, Oracle's java.util.regex documentation does not mention TR18 but instead says "Go read Mastering Regular Expressions for help using this library" 🤣

Related Documentation

An excellent tutorial and overview of regular expressions is Mastering Regular Expressions, Jeffrey E. F. Friedl, O'Reilly and Associates, 1997.

Jokes aside, Rust's regex 1.1.0 crate, RogueWave's Internationalization module, and the eponymous ICU Regular Expression Engine all support TR18 Level 1, at minimum. The package in the R CRAN repository, stringi, internally uses C bindings to the ICU C Regex Engine. It looks like Python 3.7 considered replacing the re module with a regex module that supports ICU as well.

In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.

There's more. Eg., we are also missing symmetric difference ([a-g~~b-h] ), and it looks like we are missing some properties as well, e.g I noticed \p{LC} is (accidentally?) missing. I only looked briefly though, I think it would be worth doing a more careful look and opening a separate issue(s) for discrepancies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029]) will presumably bypass Stephen's ASCII optimizations to character class matching, even though the pattern might be otherwise ASCII. That might justify some tuning later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will presumably bypass Stephen's ASCII optimizations to character class matching

Most of it will still apply, e.g. we'll still generate a fast lookup table for ASCII, and only fall back to the slower path if the actual character being examined isn't ASCII. And it'll only happen if you opt-in to this AnyNewLine feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and also: The Level 1 of the specification seems most solid, whereas Level 2 and Level 3 are undergoing change. Level 3 is proposed to be completely removed in the next version, subject to vendor feedback. They're basically asking, "Is this too crazy for anyone to implement?"

They've seemed to have taken feedback from EcmaScript Tc39 as some basic guideline on what's reasonable, as part of JavaScript's major contribution to humanity in the last decade: making sure you can Regex match emojis via RegEx character class:

const reRgiEmoji = /\p{RGI_Emoji}/u;

@danmoseley
Copy link
Member

danmoseley commented Jan 13, 2020

I hadn't realized and this wasn't mentioned before, but there is a Unicode spec specifically regarding regular expressions and has a section on newline treatment:

http://www.unicode.org/reports/tr18/#RL1.6
It says:

RL1.6 Line Boundaries
  To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).

Re dot, https://www.regular-expressions.info/dot.html describes some of the varying support of newlines in different engines -- varying combinations of the above list of 6.

(Incidentally it also describes \N which is "dot not affected by single line mode". Normally in single line mode, dot will match line breaks; \N will not. This is not listed in RL1.6 though. It mentions \R)

What this means for me is -- it makes sense to keep \r in since Unicode recommends it -- and at some future point, we may want to add support for \u0085, \u2029 and \u2028 as well. (It might be interesting to read this RL1.6 entirely to compare with our implementation in general.)

@stephentoub
Copy link
Member

at some future point

Do other implementations follow this standard in general?

Seems to me like if we're adding "AnyNewLine" now, we should do our best now to make it as correct as possible. It would be a breaking change in the future to augment the meaning of that with additional characters.

@danmoseley
Copy link
Member

Sounds fine to me.

@danmoseley
Copy link
Member

Do other implementations follow this standard in general?

As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from.

@danmoseley
Copy link
Member

@shishirchawla I'm guessing @jzabroski might be willing to help push to this branch to help get this PR over the line, if you've set permissions appropriately.

@jzabroski
Copy link
Contributor

@shishirchawla I'm sorry I've tormented you. I hate seeing geniuses wait to see their invention unfold to success. Happy to help bring it to life for you.

@shishirchawla
Copy link
Author

@danmosemsft @jzabroski Sorry about the late response, I am actually out traveling and it would be a bit hard for me to update this in the next couple of days, so please feel free to make any changes and complete the PR. I can pick it up again later if required.

@danmoseley
Copy link
Member

@jzabroski can you successfully push to this branch? If not - I think @shishirchawla has to check the box - or you can continue work in a new PR

@jzabroski
Copy link
Contributor

@jzabroski can you successfully push to this branch? If not - I think @shishirchawla has to check the box - or you can continue work in a new PR

You're just asking me to add:

  1. TR18-style AnyNewLine
  2. Update UseOptionA() documentation to describe it supports TR18-style line endings
  3. Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n]. If AnyNewLine is set, . should be able to match \r\n as a single character. The pattern should be implemented as (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])
  4. New tests to cover TR18-style line endings

Additionally, it is worth disclosing that although the rust regex 1.1.0 crate claims to implement TR18 Level 1, the code suggests otherwise, at least as far as R1.6 "Unicode New Line" is concerned: https://github.com/rust-lang/regex/blob/master/src/compile.rs#L271-L286 - it uses just \n

The only code that likely actually implements TR18 Level 1 R1.6 "Unicode New Line" is maybe ICU's C library. icu4j does not implement this, as far as I can tell. It instead implements a class called UnicodeSet, which is very fast but lacks backreference and doesn't seem to handle line endings at all (if it does, I missed where in the code it does this).

@danmoseley
Copy link
Member

I think so. I'd like @stephentoub to confirm, since he's got regex top of mind at the moment, and I'm sure we'd all like to avoid asking you guys for yet more iterations 😃

@danmoseley
Copy link
Member

although the rust regex 1.1.0 crate claims to implement TR18 Level 1, the code suggests otherwise, at least as far as R1.6 "Unicode New Line" is concerned:

I see in rust-lang/regex#244 they don't have a plan to support \r\n. I guess this would affect ripgrep as well (?). I understand that VS Code uses ripgrep and apparently they work around it by normalizing \n to \r?\n (just speculation, I didn't really look, but I did verify that $ seems to match \r\n successfully)
https://github.com/microsoft/vscode/blob/ee0960b25bb95d129c638e2f2781282ab9e60793/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L533-L536

@jzabroski
Copy link
Contributor

@@ -243,6 +243,7 @@ public enum RegexOptions
RightToLeft = 64,
ECMAScript = 256,
CultureInvariant = 512,
AnyNewLine = 1024,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shishirchawla Please make sure to get this new Enum documented, either with triple slash comments on top of the implementation (src file) or directly in the dotnet-api-docs repo.

@stephentoub
Copy link
Member

I'd like @stephentoub to confirm

Sounds right.

@jzabroski, should we close this PR and you can open a new one when you're ready?

@shishirchawla, thanks for working on this.

@jzabroski
Copy link
Contributor

@stephentoub Yes, I think that's fair - let's keep things tidy & close it, as I imagine it adds onerous burden to see open PRs not getting closed. I tried to get to it this weekend but spent ~4 hours troubleshooting a build issue I was expecting to resolve in 2 minutes. :(

@stephentoub
Copy link
Member

Thanks. Sorry to hear you're having build issues. Please open an issue if they continue and we can get you unblocked. @ViktorHofer

@jzabroski
Copy link
Contributor

Thanks. Sorry to hear you're having build issues. Please open an issue if they continue and we can get you unblocked. @ViktorHofer

Thanks - issue is not with this project, but with FluentMigrator project I co-maintain. I pinged rainer for help. Works in Visual Studio, not in JetBrains Rider.

@@ -1764,6 +1778,7 @@ private static RegexOptions OptionFromCode(char ch)
'd' => RegexOptions.Debug,
#endif
'e' => RegexOptions.ECMAScript,
'a' => RegexOptions.AnyNewLine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that the original proposal didn't suggest we add an inline option for AnyNewLine - it's reasonable to do but we'd need tests for it. It looks like we don't have a great set of tests for existing inline regex options other than just setting them once.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants