-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aspire dashboard sometimes doesn't start but the app host boots successfully #2198
Comments
It looks like after CMD + C on a If I wait the 15 seconds, I can't seem to repro the issue. I'm using a MacBook with an M2 Max so this delay seems excessive. CleanShot.2024-02-13.at.11.51.16.mp4Figure: I am running |
@karolz-ms are there any protections in DCP against DCP still being in the process of shutting down child processes whilst a new one is coming up (leading to unexpected failures)? |
... and note that whilst this appears to be dashboard related, it might just be a general issue that the dashboard happens to suffer from because we actively ping the endpoint to make sure it is alive before continuing. It should eventually timeout. |
cc @karolz-ms |
I've managed to repro this in our repository by switching to using the workload code path for launching the dashboard (as opposed to launching the dashboard as a project resource). When you hit CTRL-F5 the old AppHost process gets shut down and relaunched immediately, and it seems to take some time for the old DCP process to shutdown the old dashboard. In the meantime the new process comes along and fails to start, and the old one eventually dies too. Here is the error that the dashboard is throwing
|
There is another issue related to this. If we CTRL-F5, the startup logic will take the previous instance as being the dashboard it was launching and return that it started up successfully. VS will launch the browser on the previous instance and establish a Blazor circuit. Then the underlying process will disappear and the new endpoint will become available - but the old circuit could not be re-established ... so you then have to refresh the page for the UI to work again instead of it automatically reconnecting. |
@BrennanConroy can you take this one on? I'm not sure what the solution is going to be whether it makes sense for us to take over process launch. I think the criteria for success here probably looks like:
We'll need to figure out which Aspire.Dashboard process to kill because there might be multiple instances open on a machine and we only want to kill the one that is blocking our new dashboard process starting. |
cc @drewnoakes for awareness on the dashboard side. |
A hacky fix is to put a retry in the dashboard startup. If the port is in use, then catch the error, wait a few seconds, then try again. But if the dashboard has this problem, then other apps could as well. When DCP is prompted to start, it could check whether there are left over processes. Processes are stopped completely, bringing the machine to a blank slate, then DCP starts the new app. |
@mitchdenny what about:
This doesn't solve all the problems, but it would mean the app host won't get confused about which dashboard it's talking to. |
@drewnoakes YES we should do this. The problem isn't just the apphost, it's VS connecting to the dashboard port (when the old dashboard is running) and failing. That's also a big problem. We should be able to distinguish 2 things:
That means we need to be able to tell if a shutdown signal was ever sent to the dashboard to begin with. We may have to do more work to be explicit here with dcp and wait for the shutdown signal to be sent to the dashboard (or something similar). To fix VS hitting the same port but using the wrong dashboard, I don't have a solution. Maybe we can punt it... The only thing that comes to mind is having the apphost own the configured dashboard port, but then having the actual dashboard port be random (proxy). That would mean the apphost has another server or endpoint forwarding traffic to the "right" dashboard instance once we resolved it (we would hold the request in the proxy until the dashboard was ready). In this scenario, the dashboard never fails to launch (or rarely does) as its a random port , the apphost is the one that owns the port so when it dies the browser connection dies. Maybe it's overkill? Thoughts? |
@karolz-ms This is why I wanted us to add restart unless stopped behavior. This is something that dcp should just do and aspire.hosting would configure. |
In the video provided by @wicksipedia, DCP takes a LONNNG time to kill the dashboard. Maybe we should get logs to see if graceful shutdown in the dashboard is too graceful in some cases 😄 |
I think this is a pretty good idea. |
I thought this liveliness check was a temporary measure until opentelemetry-dotnet supports retries. It was added to prevent resource telemetry being lost because the dashboard isn't ready. Are we going to keep it after OTEL supports retries? The problem with the liveliness check is it delays other resources starting until after the dashboard starts. |
Shouldn't Kestrel stop accepting new connections on shutdown start? The graceful part is completing existing requests. |
Lets say we decide not to keep this logic. The browser launch problem gets worse.
It does, but we don't know what state the dashboard is in and we can't prevent the race. |
Which problem is that? The dashboard liveliness check is used to prevent other apps starting without the dashboard.
1 & 2 are the same. Is 3 an improvement? It seems like the two things to do are:
|
OK lets explore 1. What does the code change look like? Do we have a try catch around main in the dashboard in a loop? How many times, how many delays? |
Yeah, a try/catch in a loop in main. Delay could start a 0.5 seconds and back off to 5 seconds. Either keep retrying there forever or give up. Aspire host should provide a nice warning in the console that the dashboard isn't starting. The host already has the dashboard console log so maybe that could be printed to the host console to indicate the problem. |
Classic approach to detecting multiple copies of an app running would be to write a pidfile to a known location (a text file with the process ID). If you find an existing pidfile at startup you can (after checking that the process with that PID matches what you expect) terminate it before writing your own pidfile. (Or warn that there’s already a running instance, etc.). |
A lot of this drama would be avoided if the dashboard used a dynamically allocated port. This not only solves the "which dashboard should I be talking to" problem here, but also the inability to run more than one Aspire app at the same time. Would that be something acceptable from UX perspective? Regarding slow shutdown of the previous app instance, it is worth exploring in detail. On Windows DCP pretty much calls We have a work item to add the ability to automatically restart Executables and Containers to DCP, currently scheduled for P4. @dbreshears FYI |
We're merging some PRs to refactor the resource cleanup code upstream in the orchestrator which should help improve how quickly the dashboard goes away on shutdown. |
The dashboard port comes from launchSettings.json and I believe it is randomly generated when the project is created. That is standard behavior for aspnetcore projects. Launching the dashboard with a random port each time wouldn't be a good user experience. |
Fair enough. Somehow I was under impression the dashboard gets the same port for all apps, glad it is not the case |
#2216 has the orchestrator shutdown resources cleanup improvements I was mentioning. |
Given we are talking about the CTRL-F5 scenario here, is it possible that the thing that is stopping the dashboard shutting down be the browser blazor circuit being open to the old dashboard process? Perhaps in the dashboard we should more aggressively poll the resource endpoint and if it doesn't respond (i.e. the AppHost has been killed) then we shutdown? |
One definite issue is that we were likely waiting for containers to stop before we tried shutting down the executable, so depending on the size of the project, that could cause quite a bit of delay. Latest DCP that just merged makes resources types clean up much more in parallel than they did before. |
I wonder if we should have an annotation that allows us to specify a max delay for graceful shutdown before you just yank the process away. For Dashboard we should yank the process away immediately. But this might be useful per resource. |
We've got a work item on the backlog to make the shutdown process configurable on a per-app basis (what signal(s) to try sending, how long to wait, etc.). Basically the same thing you're proposing. |
SignalR kills connections on shutdown. It doesn’t wait for graceful closure (i.e. the browser closing). The dashboard has a long running gRPC call to hosting for resources. It doesn't poll. But losing that connection shouldn’t be a trigger to shutdown. Hosting should tell the dashboard to stop. |
I'd like to make sure we have a clear path here. I see lots of great ideas but we to classify the short, medium and longer term fixes. Before I do that, let me add another "solution". Today we create a service in DCP and then start making HTTP requests which can lead to us hitting the old endpoint. Instead of doing fancy correlation, we can ask the DCP instance for the status of the executable we just created (using watch). We can re-create (restart) without doing it inside of the dashboard code itself. Only after we know the service has shifted to the "running" state we can stop this loop. This is what I assume DCP will do longer term. As for shutdown, we can get the pid back from DCP on startup and do an explicit kill of the dashboard when shutting down Aspire.Hosting (process kill). |
We (@danegsta specifically) did a pass over DCP code and identified a few places where we were slowing down the DCP shutdown process unnecessarily. These changes have been already merged and the Aspire workload should have them. David is still investigating another slow shutdown case that affects non-Windows platforms and that one hopefully also makes it into Aspire P4. All these changes constitute a "short-term" fix but I am fairly confident will go a long way towards addressing this issue. If it turns out more work is necessary, then we will re-assess, but so far I haven't seen anything that would fundamentally prevent us from shutting down the dashboard very quickly when it is time to do so. |
Thanks @karolz-ms! I have a local branch that tries to do what I suggesting using dcp to resolve the health of the dashboard and auto restart if it dies during the startup timeout |
Sweet! Add me to the PR if you decide to go ahead with that change. |
On a related note, I've found that Aspire's own resource cleanup logic can get in the way of successfully cleaning up all the Executables on shutdown: #2324 |
@davidfowl the dashboard shutdown experience is significantly improved with the latest orchestrator insertions and the shutdown fixes from #2324 the the point where I can't reproduce this behavior anymore on my system (either on a simple web app repo or with eShop clobbered into running preview 4 bits). I can spam ctrl+F5 over and over and haven't had any failures on subsequent launches. |
Is this also why I can't reproduce it anymore 😄 |
Quite possibly; #2324 also fixed the locked files error on shutdown. I think most of our shutdown bugs are fixed at this point. |
Nice. I'd still like to take a look at end-to-end shutdown times myself. Maybe add a stopwatch + logging so there is a simple way to observe how long hosting and the dashboard take to shut down. |
Easy option may be to turn on orchestrator diagnostics logging with |
There is more than DCP. For example, Kestrel attempts to gracefully close outstanding connections. And DisposeAsync runs on some services. Want to measure everything. How long between shutdown start and shutdown finished in these apps. |
Moving this to p5 since we made some progress here. I still want to make sure the communication between the apphost and the dashboard and dcp are very reliable. |
Closing in favor of #2422 |
@wicksipedia found this in the .NET Aspire workshop in NDC Sydney. I can reproduce the same with repeated Ctrl+F5 attempts on the same project. It feels like a race condition. This showed up in 2 ways that I've seen:
My theory is this:
The text was updated successfully, but these errors were encountered: