Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to make ProcessNameMatchesScriptName test more reliable #46292

Closed

Conversation

adamsitnik
Copy link
Member

This very simple test can sometimes fail: #13757

I was not able to reproduce it, but based on the logs attached to the original issue I believe that the GetTestFileName might in theory be returning some non-ascii characters and Linux might not like it, so I've replaced it with const value
image

According to man docs the file might be empty if the process is a zombie (it has already exited but parent process has not acknowledged that yet) so I've added a check for that.

This read-only file holds the complete command line for the
process, unless the process is a zombie. In the latter case,
there is nothing in this file: that is, a read on this file
will return 0 characters.

I am not a Linux expert, but I guess that sleep 600 could become a zombie right away only if there is a custom alias for the sleep command. I don't know if it's possible, but I think that we can just use full path to be 100% sure we are calling the actual /bin/sleep.

Fixes #13757

@tmds is there any chance you could take a look? It's nothing urgent, I just want to get rid of unstable tests from my areas.

@adamsitnik adamsitnik added area-System.Diagnostics.Process test-bug Problem in test source code (most likely) labels Dec 21, 2020
@adamsitnik adamsitnik added this to the 6.0.0 milestone Dec 21, 2020
@ghost
Copy link

ghost commented Dec 21, 2020

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

This very simple test can sometimes fail: #13757

I was not able to reproduce it, but based on the logs attached to the original issue I believe that the GetTestFileName might in theory be returning some non-ascii characters and Linux might not like it, so I've replaced it with const value
image

According to man docs the file might be empty if the process is a zombie (it has already exited but parent process has not acknowledged that yet) so I've added a check for that.

This read-only file holds the complete command line for the
process, unless the process is a zombie. In the latter case,
there is nothing in this file: that is, a read on this file
will return 0 characters.

I am not a Linux expert, but I guess that sleep 600 could become a zombie right away only if there is a custom alias for the sleep command. I don't know if it's possible, but I think that we can just use full path to be 100% sure we are calling the actual /bin/sleep.

Fixes #13757

@tmds is there any chance you could take a look? It's nothing urgent, I just want to get rid of unstable tests from my areas.

Author: adamsitnik
Assignees: -
Labels:

area-System.Diagnostics.Process, test bug

Milestone: 6.0.0

@@ -159,9 +159,9 @@ public void ProcessStart_UseShellExecute_OnUnix_SuccessWhenProgramInstalled(bool
[PlatformSpecific(~TestPlatforms.OSX)] // On OSX, ProcessName returns the script interpreter.
public void ProcessNameMatchesScriptName()
{
string scriptName = GetTestFileName();
const string scriptName = nameof(ProcessNameMatchesScriptName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the underlying filesystem imposes additional restrictions, Linux filenames can contain any bytes except \0 and /.
.NET assumes the bytes are UTF-8.
So, I think the filename is not an issue. If it were, we should make sure GetTestFileName returns a valid filename on Linux instead of doing something specific in this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I do'nt see a problem with GetTestFileName(), and if it has a problem, it will be affecting other tests. Either way no change is needed here.


Assert.Equal(scriptName, process.ProcessName);
// cmdLine can sometimes be null: https://github.com/dotnet/runtime/issues/13757
// according to docs, it's when the process is a zombie: https://man7.org/linux/man-pages/man5/proc.5.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Assert that fails was added to understand better why ProcessName doesn't end up with the proper value. It verifies some expectations of the .NET implementation.

Can we verify the failing case has a zombie process? We can check /stat for the Z flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more.

Probably this test fails when the Assert happens after the script has already exited. .NET will have reaped the process, and everything under /proc is gone.

This happens when sleep doesn't work, or when the asserts are delayed for 10 min. We can safely increase the 10 min timeout. And add an assert to check !process.HasExited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens when sleep doesn't work

How would this be possible?

when the asserts are delayed for 10 min

this is unlikely, as we just start the process and execute asserts right away

everything under /proc is gone

Based on the CI logs it looks like not everything is gone. The previous line that asserts the content of /proc/{process.Id}/stat works just fine. Based on the available docs the only explanation I was able to find was the zombie process.

}
finally
{
process.Kill();
process.WaitForExit();
if (!process.HasExited)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add this condition?

string filename = Path.Combine(TestDirectory, scriptName);
File.WriteAllText(filename, $"#!/bin/sh\nsleep 600\n"); // sleep 10 min.
File.WriteAllText(filename, $"#!/bin/sh\n/bin/sleep 600\n"); // sleep 10 min.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for changing the to explicitly using /bin/sleep? I don't think we care how the sleep is implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a Linux expert, but I guess that sleep 600 could become a zombie right away only if there is a custom alias for the sleep command. I don't know if it's possible, but I think that we can just use full path to be 100% sure we are calling the actual /bin/sleep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know on Alpine it's busybox, not an actual executable file, but it's symlinked to the same place so I guess that's OK:

LAPTOP-P6UJDVTA:~$ ls -l /bin/sleep
lrwxrwxrwx    1 root     root            12 Oct 21 09:23 /bin/sleep -> /bin/busybox

@tmds
Copy link
Member

tmds commented Jan 4, 2021

Adam, thanks for working through flaky tests.
I think the timeout race (#46292 (comment)) is the most likely cause for occasional failures of this test.

@jeffhandley
Copy link
Member

@adamsitnik It looks like this PR is close to ready with just a little feedback left to address.

@danmoseley
Copy link
Member

@adamsitnik looking at history I only see one failure
(@MattGal can you tell me why the query below does not show the failures reported in #13757 (comment) ? )

https://engsrvprod.kusto.windows.net/engineeringdata?query=H4sIAAAAAAAEAG2OQUvDQBCF74H8h7GXKERiUIqXePBQiFBRG%2fC83X022yS7YWZDEfzxZrWYi6dh3vc%2beA0kvEGmPgilyRcdvXXUWWcq6xyY3j13dcAg5N3fU5t%2fu09%2b%2f1Obb21i4dSCQRvrrLQw9FCR86fL69v1jbla%2bBah9YaqilYv7DVEntWArQq6hew02zHEYLUYv4OjkW2U7bOFPLJyuqWLmTA%2bpBinvi%2fu1uV9WQzgAzIqCtorQyNHaWR%2fhA60C4oDTH72c2o%2bR%2bTnZTm9TpgQN5QxE1GHGc6O7hpWGmmSJt%2bi3mERSAEAAA%3d%3d&web=0

TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= now(-360d)
| where Method == "ProcessNameMatchesScriptName"
| where Result == 'Fail'
| where Branch != 'refs/pull/46181/merge' // bad pr
| project Started, Branch, Type, Method, QueueName1, Message, StackTrace

Started Branch Type Method QueueName1 Message StackTrace
2021-01-06 16:38:19.6450000 refs/pull/46634/merge System.Diagnostics.Tests.ProcessTests ProcessNameMatchesScriptName ubuntu.1804.armarch.open Assert.Contains() Failure\nNot found: (ProcessNameMatc)\nIn value: 374 (dotnet) R 23 1 1 0 -1 256 1 0 0 0 0 0 0 0 20 0 1 0 87610 4096 1 18446744073709551615 0 0 0 0 0 0 0 4096 0 0 0 0 17 5 0 0 0 0 0 0 0 0 0 0 0 0 0\n at System.Diagnostics.Tests.ProcessTests.ProcessNameMatchesScriptName() in //src/libraries/System.Diagnostics.Process/tests/ProcessTests.Unix.cs:line 172 at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture) in //src/mono/netcore/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs:line 378

This failure is in the first assert -- it's reading /proc/NNN/stat and finding 'dotnet' instead of 'ProcessNameMatchesScriptName_... something'.

In the original bug, the assert was finding blank when reading /proc/NNN/cmdline.

I don't know exactly how the contents of these files change when the process exits, but I wonder whether these are consistent with it having exited, but not all been cleaned up yet. (The slightly surprising thing is we're not seeing eg FileNotFoundException, but then, there's only one failure in the history)

As you point out, it seems hard to imagine that 10 minutes could have elapsed.

I suggest to just add Assert(!process.HasExited) just before the Process.Kill() and possibly Assert(File.Exists("/bin/sleep")) at the start. And call it fixed. If it's not, we will have better info. I don't think other changes are needed.

@MattGal
Copy link
Member

MattGal commented Feb 16, 2021

(@MattGal can you tell me why the query below does not show the failures reported in #13757 (comment) ? )

Taking a look. We did have a volume issue last week hotfixed by doubling the capacity of machines ingesting to Kusto, which would have resulted in delayed ingestion. Not sure this is that though.

@MattGal
Copy link
Member

MattGal commented Feb 16, 2021

@danmoseley Actually, it's obvious no investigation needed. The thing you linked is from 11/7/2019, 467 days ago. We keep a bit more than 120 days of test results in Kusto, and even that costs a fair bit of money. Once results age out they are naturally reclaimed in the data cluster.

Base automatically changed from master to main March 1, 2021 09:07
@carlossanlop
Copy link
Member

Ping @adamsitnik

@jeffhandley jeffhandley marked this pull request as draft April 18, 2021 06:40
@jeffhandley jeffhandley modified the milestones: 6.0.0, Future May 12, 2021
@adamsitnik
Copy link
Member Author

thank you all for great feedback, I am going to close the PR for now and get back to it when I have some free cycles

@adamsitnik adamsitnik closed this May 12, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Process test-bug Problem in test source code (most likely)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure : System.Diagnostics.Tests.ProcessTests.ProcessNameMatchesScriptName
6 participants