Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexOptions.Multiline should be same as line by line #40566

Closed
VS-ux opened this issue Aug 8, 2020 · 8 comments
Closed

RegexOptions.Multiline should be same as line by line #40566

VS-ux opened this issue Aug 8, 2020 · 8 comments
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner

Comments

@VS-ux
Copy link

VS-ux commented Aug 8, 2020

Using RegexOptions.Multiline should be the same as reading a file line by line I think.

When reading an entire file, and the Regexing it with RegexOptions.Multiline should be same as reading it line by line
Here is a code sample:

`
string path = @"PathToFile";
using (StreamReader reader = new StreamReader(path))
{
//string s = reader.ReadToEnd();
//int matchCount = Regex.Matches(s, @"^(.+)/([^/]+)$", RegexOptions.Compiled | RegexOptions.Multiline).Count; //I think this should be the same as below:

int matchCount = 0;
string s;
while ((s = reader.ReadLine()) != null)
{
    matchCount += Regex.Matches(s, @"^(.+)\/([^/]+)$", RegexOptions.Compiled).Count;
}

Console.WriteLine(matchCount);

}
`

This is probably related to #25598
This may be I'm just completely doing it wrong, and I apologize if this is very stupid.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Aug 8, 2020
@ghost
Copy link

ghost commented Aug 8, 2020

Tagging subscribers to this area: @eerhardt, @pgovind
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member

danmoseley commented Aug 8, 2020

What are the line endings in your test file? \n or \r\n? If the latter, in the first code snippet above, the $ will match the \n after each \r (since we don’t have the AnyNewLine setting yet). This means your pattern will capture the \r. In the second snippet, the reader removes all \r and \n so it would not match the \r. However I wouldn’t expect your pattern to match a different number of times in this case.

What result do you see? Can your share a test file?

@danmoseley danmoseley reopened this Aug 8, 2020
@VS-ux
Copy link
Author

VS-ux commented Aug 8, 2020

@danmosemsft Thank you for your reply!
My results were with reading the entire file at once, and using RegexOptions.Multiline | RegexOptions.Compiled I got 125797 matches. Reading the same file line by line with just RegexOptions.Compiled I got 176948 matches.

The test file I used is actually a random binary file, but the results can be seen in a simple test case scenario:

const string regex = @"^(?:[a-zA-Z]\:|\\\\[\w\.]+\\[\w.$]+)\\(?:[\w]+\\)*\w([\w.])+$";
string s = "C:\\users\\user\\test.txt\r\nC:\\user2\\test2.txt";
int matchCount = Regex.Matches(s, regex, RegexOptions.Compiled | RegexOptions.Multiline).Count;

int matchCountLineByLine = Regex.Matches("C:\\users\\user\\test.txt", regex, RegexOptions.Compiled).Count;
matchCountLineByLine += Regex.Matches("C:\\users\\user2\\test2.txt", regex, RegexOptions.Compiled).Count;

Console.WriteLine(matchCount); // Results in 1. If you swap to \n, we get 2
Console.WriteLine(matchCountLineByLine); // Results in 2

In the "matchCountLineByLine" section, I try to simulate reading the text file line by line, which contains 2 file paths.

@VS-ux
Copy link
Author

VS-ux commented Aug 8, 2020

Hi! OK, so I uploaded a System.Memory.dll binary test file that I did some tests on.
Here is the repo:

Here is the code:


using StreamReader reader = new StreamReader(@"PathTo\System.Memory.dll");

string s = reader.ReadToEnd();
int matchCount = Regex.Matches(s, @"^(.+)\/([^/]+)$", RegexOptions.Compiled | RegexOptions.Multiline).Count;

//int matchCount = 0;
//string s;
//int lineCount = 1;
//while ((s = reader.ReadLine()) != null)
//{
//    matchCount += Regex.Matches(s, @"^(.+)\/([^/]+)$", RegexOptions.Compiled | RegexOptions.Singleline).Count;
//    writer.WriteLine(lineCount.ToString());
//    lineCount++;
//}

Console.WriteLine(matchCount);

For the code the is NOT commented, I get 148 results. For code that is commented (Reading line by line), I get 166 matches.

@danmoseley
Copy link
Member

I suggest you creates a fixed test file then progressively reduce your test file until it is as small as possible yet still repros the problem and the reason will become clear. You may need a hex editor to see line endings.

@VS-ux
Copy link
Author

VS-ux commented Aug 8, 2020

@danmosemsft I think the reason why is in binary file like System.Memory.dll there are some lone \r line endings without \r\n. It then is unable to match it.

@danmoseley
Copy link
Member

Aha yes! In StreamReader i see it will match a lone \r. That is unusual in real text, I believe it was used by old iMacs. It will not be matched by the ^ and $ (again, until we have AnyNewLine)

I think this problem is explained, can we close the issue now?

@VS-ux
Copy link
Author

VS-ux commented Aug 8, 2020

@danmosemsft Yes sure.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants