Get rid of the requirement to manually provision `procdump.exe` #2972

KirillOsenkov · 2021-06-30T02:18:39Z

It is absolutely essential that the ability to collect dumps on test host process crashes can be turned on as simple as possible. It's a huge drain on productivity if every user who wants to do this has to go on a wild goose chase to figure out how to deploy it and add it to the path.

Let's find a way to ensure procdump.exe is included in the agent image, and just use that.

The text was updated successfully, but these errors were encountered:

nohwnd · 2021-06-30T10:56:45Z

VSTest task carries procdump with it in both yaml and classic pipeline and allows crash dumps to be collected by selecting "Collect advanced diagnostics in case of catastrophic failures"

(input in yaml: diagnosticsEnabled: true)

Debugging issues with setting up procdump can be difficult. When running your pipeline, select to enable diagnostics. The VSTest task will detect this and enable diagnostic run, and upload your logs.

Later you can download them by clicking on your run, and in upper right corner clicking on ... and selecting Download logs.

All our diag logs have extension .diag, and in datacollector.diag you will find additional information about what exactly happened.

When using dotnet test directly for example for testing locally.

On modern frameworks (.NET 5+) you don't need procdump to collect crash dumps, because that ability is built in natively into the runtime. For older frameworks, e.g. .NET Framework 4.7.2 you do need to install procdump.

To still use dotnet test --blame-crash in azdo, without using the task above you can use chocolatey to install it. It is installing into PATH and we check PATH for procdump.

As you shown in other issue you can install procdump using choco reasonably easy (but make sure to only run it on windows):

#2957 (comment)

- powershell: |
    choco install procdump -y
  displayName: Install Procdump
  condition: and(succeeded(), eq(variables['Agent.OS'], 'Windows_NT'))

Or if you prefer checking OS in powershell directly:

- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: 'if ($IsWindows) { choco install procdump -y }'
    pwsh: true
  displayName: 'Install Procdump'

KirillOsenkov · 2021-07-01T03:22:24Z

On modern frameworks (.NET 5+) you don't need procdump to collect crash dumps, because that ability is built in natively into the runtime. For older frameworks, e.g. .NET Framework 4.7.2 you do need to install procdump.

This is valuable information, it's absolutely not clear from the docs. I shouldn't have messed with procdump since I also updated to 5.0.

Yes I've already figured out the incantations to install Procdump, but unfortunately the process doesn't seem to be crashing when procdump is installed :(

But I understand now that the issue as I filed it doesn't make much sense, because it's unclear whether dotnet test should bring procdump with it. Since the VSTest task already does that, then maybe the right workaround is just to switch to the VSTest task.

KirillOsenkov · 2021-07-01T03:27:32Z

Perhaps the DotNetCoreCLI@2 task should be more equivalent to the VSTest task, including bringing procdump with it.

Maybe that's how we should interpret this issue. Using DotNetCoreCLI@2 with dotnet test should be equivalent to using VSTest.

KirillOsenkov · 2021-07-01T18:54:33Z

The consensus everywhere is to not use procdump, but use the LocalDumps registry key instead:

Don't use procdump to create dumps when a process crashes. It can cause crashes and is less reliable than the built-in windows registry feature for creating dumps when a process crashes.

https://docs.microsoft.com/en-us/visualstudio/ide/how-to-increase-chances-of-performance-issue-being-fixed?view=vs-2019#crashes

https://github.com/dotnet/roslyn/blob/6543c7086f9154477330b01bb43f45d897f36325/eng/test-build-correctness.ps1#L37-L43

https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Tools/Source/RunTests/ProcDumpUtil.cs#L55-L63

nohwnd · 2021-07-19T10:25:51Z

I think local dumps would work for some scenarios. But what if you have multiple agents on the same system? We definitely don't want to automatically upload dump of testhost and other processes that crashed in a different agent.

There are few pros and cons to localdumps and procdump:

Procdump:

✔ control which process is dumped and where the dump is placed
✔ easy to limit the "scope" of what is dumped
✔ can run in parallel to other test runs
✔ dump on process exit even if there was no exception
✔ good observability because there are logs directly from procdump
❌ must be deployed
❌ has non-trivial performance cost especially with the first chance exception setting
❌ cannot dump all crashing processes

Local dumps:

Imho the inverse of the above.

❌ hard to control which exact process is dumped
❌ hard to limit the scope of what is dumped
❌ affects all parallel runs, and other runs can disable it, or steal our dumps
❌ no option to dump on exit without exception?
❌ difficult to observe without looking into event log
✔ no need to deploy it
✔ small (or 0?) performance cost
✔ can dump any crashing process

Depending on the scenario that is investigated, those pros could be cons, and cons could be pros. When a process is crashing under testhost, and you have a single agent running on the whole system you would be happy to just enable "crash mode", which would collect all dumps, and a lot of stuff from event log.

On the other hand when a build server is running multiple agents, you probably never want to enable all dumps from the system to be collected.

I think the best option here is to continue using procdump, because crash diagnostics are usually enabled when needed, so the more narrow scope is more beneficial here.

In reality there are just a few types of "crashes" that we've seen with testhost:

Critical exceptions, most commonly Access Violation Exception and Stack Overflow exception: Especially when running in the CPP runner, where we call unmanaged code from the managed runner. Those are easy to catch and see from the dump.

The only thing to note is that AVE from unmanaged code gets translated to Null Reference Exception in some cases, which makes it confusing when looking at the log that reports NRE while the dump reports AVE.

.NET (Core) is also much better on spelling out when Stack Overflow happened. The exception shows the whole stack and you don't even need to run it with dumping.

Exits - User asks the process to exit, either via Process.Exit, or via Environment.FailFast. This is harder to diagnose, as we've seen in the case we are investigating. But far from impossible. This definitely needs to be better documented. Useful info is actually in Event log, I know that clould build is collecting that. I am not sure if we should also report it automatically in our pipeline when testhost crash occurs.

The global flags silent exit reporting seem useful for confirming the crash is coming from within the process, but not hugely necessary.

Unhandled exceptions - Managed unhandled exceptions that kill the test host, end up being written into the console, or at least in the log, where we can see them directly.
Hangs - thought these are not crashes, there are a different category of a problem connected to test host run. Those are easy to diagnose with the hang dump collector which does not rely on procdump.

In any case a coprehensive article to describe how to approach this would be nice. And I should write one.

There are also some improvements that could be done:

One improvement that should be done here is that -n 10 (or some other arbitrary number) should be used for the proc dump, because we are looking at first chance exceptions, and procdump only collects the first dump by default. When there is handled AVE in the code, we collect it as dump, but then never collect the actual crashing exception.

Possibly an exit monitor could correlate the exit of process to the dump. Or delete the non-crashing dumps when it sees that the process did not crash after the exception.

A mode for blame data collector, to set procdump to capture unhandled managed exceptions capturing instead of critical errors as we have it now.

nohwnd · 2024-07-08T16:01:22Z

Enabling local dumps via --blame is a new feature and won't be implemented, we are focusing on adding new features to Testing.Platform instead. https://aka.ms/testingplatform

nohwnd mentioned this issue Jun 30, 2021

Document the full arguments and syntax for --blame-crash-dump-type #2957

Closed

KirillOsenkov mentioned this issue Jul 1, 2021

--blame-crash-dump-type is running procdump.exe with wrong arguments #2971

Closed

nohwnd closed this as not planned Won't fix, can't repro, duplicate, stale Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of the requirement to manually provision `procdump.exe` #2972

Get rid of the requirement to manually provision `procdump.exe` #2972

KirillOsenkov commented Jun 30, 2021

nohwnd commented Jun 30, 2021 •

edited

Loading

KirillOsenkov commented Jul 1, 2021

KirillOsenkov commented Jul 1, 2021

KirillOsenkov commented Jul 1, 2021

nohwnd commented Jul 19, 2021

nohwnd commented Jul 8, 2024 •

edited

Loading

Get rid of the requirement to manually provision procdump.exe #2972

Get rid of the requirement to manually provision procdump.exe #2972

Comments

KirillOsenkov commented Jun 30, 2021

nohwnd commented Jun 30, 2021 • edited Loading

KirillOsenkov commented Jul 1, 2021

KirillOsenkov commented Jul 1, 2021

KirillOsenkov commented Jul 1, 2021

nohwnd commented Jul 19, 2021

nohwnd commented Jul 8, 2024 • edited Loading

Get rid of the requirement to manually provision `procdump.exe` #2972

Get rid of the requirement to manually provision `procdump.exe` #2972

nohwnd commented Jun 30, 2021 •

edited

Loading

nohwnd commented Jul 8, 2024 •

edited

Loading