Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aspire dashboard sometimes doesn't start but the app host boots successfully #2198

Closed
davidfowl opened this issue Feb 13, 2024 · 45 comments
Closed
Assignees
Labels
area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication area-dashboard

Comments

@davidfowl
Copy link
Member

davidfowl commented Feb 13, 2024

@wicksipedia found this in the .NET Aspire workshop in NDC Sydney. I can reproduce the same with repeated Ctrl+F5 attempts on the same project. It feels like a race condition. This showed up in 2 ways that I've seen:

  • The apphost is running normally, but the dashboard doesn't run
  • The apphost doesn't start because the dashboard failed to start

My theory is this:

  • App host boots up, launches the dcp and the dashboard (via dcp) on a fixed port P
  • Kill the app host which kills, dcp, which starts to kill the services (including the dashboard)
  • boot up a new app host, it boots the new dashboard but fails because port P is in use
  • The app host it pings the dashboard url on port P, successfully returns because the old dcp didn't kill the dashboard yet
  • The apphost boots up assuming the dashboard is fine
@wicksipedia
Copy link

wicksipedia commented Feb 13, 2024

Update: looks like the windows to repro is more like <= 5 seconds... regardless of whether the Aspire processes are running

It looks like after CMD + C on a dotnet run command - it can take about 15 seconds for the Aspire dashboard + orchestrator processes to end!

If I wait the 15 seconds, I can't seem to repro the issue.

I'm using a MacBook with an M2 Max so this delay seems excessive.

CleanShot.2024-02-13.at.11.51.16.mp4

Figure: I am running clear && ps auxww | grep Aspire to check if Aspire is running

@mitchdenny
Copy link
Member

I wonder if this might explain some of the flakiness, we've seen in the test runs.

/cc @radical @eerhardt

@mitchdenny
Copy link
Member

@karolz-ms are there any protections in DCP against DCP still being in the process of shutting down child processes whilst a new one is coming up (leading to unexpected failures)?

@mitchdenny
Copy link
Member

... and note that whilst this appears to be dashboard related, it might just be a general issue that the dashboard happens to suffer from because we actively ping the endpoint to make sure it is alive before continuing.

It should eventually timeout.

@davidfowl davidfowl added area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication and removed untriaged labels Feb 13, 2024
@davidfowl davidfowl added this to the preview 4 (Mar) milestone Feb 13, 2024
@davidfowl
Copy link
Member Author

cc @karolz-ms

@mitchdenny
Copy link
Member

I've managed to repro this in our repository by switching to using the workload code path for launching the dashboard (as opposed to launching the dashboard as a project resource). When you hit CTRL-F5 the old AppHost process gets shut down and relaunched immediately, and it seems to take some time for the old DCP process to shutdown the old dashboard.

In the meantime the new process comes along and fails to start, and the old one eventually dies too. Here is the error that the dashboard is throwing

System.IO.IOException: Failed to bind to address https://127.0.0.1:15887: address already in use.
 ---> Microsoft.AspNetCore.Connections.AddressInUseException: Only one usage of each socket address (protocol/network address/port) is normally permitted.
 ---> System.Net.Sockets.SocketException (10048): Only one usage of each socket address (protocol/network address/port) is normally permitted.
   at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, Boolean disconnectOnFailure, String callerName)
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Bind(EndPoint localEP)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketTransportOptions.CreateDefaultBoundListenSocket(EndPoint endpoint)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketConnectionListener.Bind()
   --- End of inner exception stack trace ---
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketConnectionListener.Bind()
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.SocketTransportFactory.BindAsync(EndPoint endpoint, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.TransportManager.BindAsync(EndPoint endPoint, ConnectionDelegate connectionDelegate, EndpointConfig endpointConfig, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.<>c__DisplayClass28_0`1.<<StartAsync>g__OnBind|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.BindEndpointAsync(ListenOptions endpoint, AddressBindContext context, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.BindEndpointAsync(ListenOptions endpoint, AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.LocalhostListenOptions.BindAsync(AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.EndpointsStrategy.BindAsync(AddressBindContext context, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.AddressBinder.BindAsync(ListenOptions[] listenOptions, AddressBindContext context, Func`2 useHttps, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.BindAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.StartAsync[TContext](IHttpApplication`1 application, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Hosting.GenericWebHostService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
   at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
   at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Run(IHost host)
   at Aspire.Dashboard.DashboardWebApplication.Run() in C:\Code\aspire\src\Aspire.Dashboard\DashboardWebApplication.cs:line 137
   at Program.<Main>$(String[] args) in C:\Code\aspire\src\Aspire.Dashboard\Program.cs:line 8

@mitchdenny
Copy link
Member

There is another issue related to this. If we CTRL-F5, the startup logic will take the previous instance as being the dashboard it was launching and return that it started up successfully. VS will launch the browser on the previous instance and establish a Blazor circuit. Then the underlying process will disappear and the new endpoint will become available - but the old circuit could not be re-established ... so you then have to refresh the page for the UI to work again instead of it automatically reconnecting.

@mitchdenny
Copy link
Member

@BrennanConroy can you take this one on? I'm not sure what the solution is going to be whether it makes sense for us to take over process launch. I think the criteria for success here probably looks like:

  1. The new Aspire.Dashboard process launches successfully and starts listening on the assigned port.
  2. The old Aspire.Dashboard process dies.
  3. Any browser tabs that are open successfully re-establish a blazor circuit and aren't in a failed state.

We'll need to figure out which Aspire.Dashboard process to kill because there might be multiple instances open on a machine and we only want to kill the one that is blocking our new dashboard process starting.

@davidfowl
Copy link
Member Author

cc @drewnoakes for awareness on the dashboard side.

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

A hacky fix is to put a retry in the dashboard startup. If the port is in use, then catch the error, wait a few seconds, then try again.

But if the dashboard has this problem, then other apps could as well.

When DCP is prompted to start, it could check whether there are left over processes. Processes are stopped completely, bringing the machine to a blank slate, then DCP starts the new app.

@drewnoakes
Copy link
Member

There is another issue related to this. If we CTRL-F5, the startup logic will take the previous instance as being the dashboard it was launching and return that it started up successfully.

@mitchdenny what about:

  1. App host creates a session GUID when it launches.
  2. When the app host launches the dashboard, it passes the session GUID as an environment variable.
  3. We create a new HTTP endpoint on the dashboard that returns this session ID.
  4. When the app host is checking for the dashboard to become ready, it checks for the expected session ID rather than just testing for HTTP readiness.

This doesn't solve all the problems, but it would mean the app host won't get confused about which dashboard it's talking to.

@davidfowl
Copy link
Member Author

davidfowl commented Feb 13, 2024

@drewnoakes YES we should do this.

The problem isn't just the apphost, it's VS connecting to the dashboard port (when the old dashboard is running) and failing. That's also a big problem.

We should be able to distinguish 2 things:

  1. If you run the same apphost project multiple times without killing the previous, the new one should fail (port conflict).
  2. If you run a new apphost instance after shutting down the old one but there's a delay, the new one wins.

That means we need to be able to tell if a shutdown signal was ever sent to the dashboard to begin with. We may have to do more work to be explicit here with dcp and wait for the shutdown signal to be sent to the dashboard (or something similar).

To fix VS hitting the same port but using the wrong dashboard, I don't have a solution. Maybe we can punt it... The only thing that comes to mind is having the apphost own the configured dashboard port, but then having the actual dashboard port be random (proxy). That would mean the apphost has another server or endpoint forwarding traffic to the "right" dashboard instance once we resolved it (we would hold the request in the proxy until the dashboard was ready).

image

In this scenario, the dashboard never fails to launch (or rarely does) as its a random port , the apphost is the one that owns the port so when it dies the browser connection dies.

Maybe it's overkill?

Thoughts?

@davidfowl
Copy link
Member Author

@JamesNK

A hacky fix is to put a retry in the dashboard startup. If the port is in use, then catch the error, wait a few seconds, then try again.

@karolz-ms This is why I wanted us to add restart unless stopped behavior. This is something that dcp should just do and aspire.hosting would configure.

@davidfowl
Copy link
Member Author

In the video provided by @wicksipedia, DCP takes a LONNNG time to kill the dashboard. Maybe we should get logs to see if graceful shutdown in the dashboard is too graceful in some cases 😄

@mitchdenny
Copy link
Member

I think this is a pretty good idea.

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

When the app host is checking for the dashboard to become ready, it checks for the expected session ID rather than just testing for HTTP readiness.

I thought this liveliness check was a temporary measure until opentelemetry-dotnet supports retries. It was added to prevent resource telemetry being lost because the dashboard isn't ready.

Are we going to keep it after OTEL supports retries? The problem with the liveliness check is it delays other resources starting until after the dashboard starts.

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

In the video provided by @wicksipedia, DCP takes a LONNNG time to kill the dashboard. Maybe we should get logs to see if graceful shutdown in the dashboard is too graceful in some cases 😄

Shouldn't Kestrel stop accepting new connections on shutdown start? The graceful part is completing existing requests.

@davidfowl
Copy link
Member Author

davidfowl commented Feb 13, 2024

I thought this liveliness check was a temporary measure until opentelemetry-dotnet supports retries. It was added to prevent resource telemetry being lost because the dashboard isn't ready.
Are we going to keep it after OTEL supports retries? The problem with the liveliness check is it delays other resources starting until after the dashboard starts.

Lets say we decide not to keep this logic. The browser launch problem gets worse.

Shouldn't Kestrel stop accepting new connections on shutdown start? The graceful part is completing existing requests.

It does, but we don't know what state the dashboard is in and we can't prevent the race.

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

The dashboard problem gets worse.

Which problem is that? The dashboard liveliness check is used to prevent other apps starting without the dashboard.

  1. With the liveliness check: Dashboard error but DCP thinks the dashboard has started. Resources start but there is no dashboard.
  2. Without the liveliness check: Dashboard error. Resources start but there is no dashboard.
  3. With the liveliness check + GUID: The dashboard can't start so nothing starts.

1 & 2 are the same. Is 3 an improvement?

It seems like the two things to do are:

  1. Add a retry mechanism to the dashboard. I agree that doing it in DCP would be best. If the dashboard ever stops then DCP should bring it back up. But we can hack it into the dashboard project in the short term.
  2. Figure out what is taking a long time to shutdown. Be less graceful.

@davidfowl
Copy link
Member Author

OK lets explore 1. What does the code change look like? Do we have a try catch around main in the dashboard in a loop? How many times, how many delays?

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

Yeah, a try/catch in a loop in main. Delay could start a 0.5 seconds and back off to 5 seconds. Either keep retrying there forever or give up.

Aspire host should provide a nice warning in the console that the dashboard isn't starting. The host already has the dashboard console log so maybe that could be printed to the host console to indicate the problem.

@danegsta
Copy link
Member

Classic approach to detecting multiple copies of an app running would be to write a pidfile to a known location (a text file with the process ID). If you find an existing pidfile at startup you can (after checking that the process with that PID matches what you expect) terminate it before writing your own pidfile. (Or warn that there’s already a running instance, etc.).

@karolz-ms
Copy link
Member

A lot of this drama would be avoided if the dashboard used a dynamically allocated port. This not only solves the "which dashboard should I be talking to" problem here, but also the inability to run more than one Aspire app at the same time. Would that be something acceptable from UX perspective?

Regarding slow shutdown of the previous app instance, it is worth exploring in detail. On Windows DCP pretty much calls TerminateProcess() right away on an Executable, so that should be pretty quick, but is sounds like something else is blocking that from happening.

We have a work item to add the ability to automatically restart Executables and Containers to DCP, currently scheduled for P4.

@dbreshears FYI

@danegsta
Copy link
Member

We're merging some PRs to refactor the resource cleanup code upstream in the orchestrator which should help improve how quickly the dashboard goes away on shutdown.

@JamesNK
Copy link
Member

JamesNK commented Feb 13, 2024

A lot of this drama would be avoided if the dashboard used a dynamically allocated port.

The dashboard port comes from launchSettings.json and I believe it is randomly generated when the project is created. That is standard behavior for aspnetcore projects.

Launching the dashboard with a random port each time wouldn't be a good user experience.

@karolz-ms
Copy link
Member

The dashboard port comes from launchSettings.json and I believe it is randomly generated when the project is created. That is standard behavior for aspnetcore projects.

Launching the dashboard with a random port each time wouldn't be a good user experience.

Fair enough. Somehow I was under impression the dashboard gets the same port for all apps, glad it is not the case

@danegsta
Copy link
Member

#2216 has the orchestrator shutdown resources cleanup improvements I was mentioning.

@mitchdenny
Copy link
Member

mitchdenny commented Feb 14, 2024

Given we are talking about the CTRL-F5 scenario here, is it possible that the thing that is stopping the dashboard shutting down be the browser blazor circuit being open to the old dashboard process? Perhaps in the dashboard we should more aggressively poll the resource endpoint and if it doesn't respond (i.e. the AppHost has been killed) then we shutdown?

@danegsta
Copy link
Member

One definite issue is that we were likely waiting for containers to stop before we tried shutting down the executable, so depending on the size of the project, that could cause quite a bit of delay. Latest DCP that just merged makes resources types clean up much more in parallel than they did before.

@mitchdenny
Copy link
Member

I wonder if we should have an annotation that allows us to specify a max delay for graceful shutdown before you just yank the process away. For Dashboard we should yank the process away immediately. But this might be useful per resource.

@danegsta
Copy link
Member

I wonder if we should have an annotation that allows us to specify a max delay for graceful shutdown before you just yank the process away. For Dashboard we should yank the process away immediately. But this might be useful per resource.

We've got a work item on the backlog to make the shutdown process configurable on a per-app basis (what signal(s) to try sending, how long to wait, etc.). Basically the same thing you're proposing.

@JamesNK
Copy link
Member

JamesNK commented Feb 14, 2024

Given we are talking about the CTRL-F5 scenario here, is it possible that the thing that is stopping the dashboard shutting down be the browser blazor circuit being open to the old dashboard process? Perhaps in the dashboard we should more aggressively poll the resource endpoint and if it doesn't respond (i.e. the AppHost has been killed) then we shutdown?

SignalR kills connections on shutdown. It doesn’t wait for graceful closure (i.e. the browser closing).

The dashboard has a long running gRPC call to hosting for resources. It doesn't poll. But losing that connection shouldn’t be a trigger to shutdown. Hosting should tell the dashboard to stop.

@davidfowl
Copy link
Member Author

I'd like to make sure we have a clear path here. I see lots of great ideas but we to classify the short, medium and longer term fixes.

Before I do that, let me add another "solution". Today we create a service in DCP and then start making HTTP requests which can lead to us hitting the old endpoint. Instead of doing fancy correlation, we can ask the DCP instance for the status of the executable we just created (using watch). We can re-create (restart) without doing it inside of the dashboard code itself.

Only after we know the service has shifted to the "running" state we can stop this loop. This is what I assume DCP will do longer term.

As for shutdown, we can get the pid back from DCP on startup and do an explicit kill of the dashboard when shutting down Aspire.Hosting (process kill).

@davidfowl davidfowl assigned davidfowl and unassigned BrennanConroy Feb 16, 2024
@karolz-ms
Copy link
Member

We (@danegsta specifically) did a pass over DCP code and identified a few places where we were slowing down the DCP shutdown process unnecessarily. These changes have been already merged and the Aspire workload should have them. David is still investigating another slow shutdown case that affects non-Windows platforms and that one hopefully also makes it into Aspire P4.

All these changes constitute a "short-term" fix but I am fairly confident will go a long way towards addressing this issue. If it turns out more work is necessary, then we will re-assess, but so far I haven't seen anything that would fundamentally prevent us from shutting down the dashboard very quickly when it is time to do so.

@davidfowl
Copy link
Member Author

Thanks @karolz-ms! I have a local branch that tries to do what I suggesting using dcp to resolve the health of the dashboard and auto restart if it dies during the startup timeout

@karolz-ms
Copy link
Member

Sweet! Add me to the PR if you decide to go ahead with that change.

@danegsta
Copy link
Member

On a related note, I've found that Aspire's own resource cleanup logic can get in the way of successfully cleaning up all the Executables on shutdown: #2324

@danegsta
Copy link
Member

@davidfowl the dashboard shutdown experience is significantly improved with the latest orchestrator insertions and the shutdown fixes from #2324 the the point where I can't reproduce this behavior anymore on my system (either on a simple web app repo or with eShop clobbered into running preview 4 bits). I can spam ctrl+F5 over and over and haven't had any failures on subsequent launches.

@davidfowl
Copy link
Member Author

davidfowl commented Feb 23, 2024

Is this also why I can't reproduce it anymore 😄

@danegsta
Copy link
Member

Quite possibly; #2324 also fixed the locked files error on shutdown. I think most of our shutdown bugs are fixed at this point.

@JamesNK
Copy link
Member

JamesNK commented Feb 23, 2024

Nice.

I'd still like to take a look at end-to-end shutdown times myself. Maybe add a stopwatch + logging so there is a simple way to observe how long hosting and the dashboard take to shut down.

@danegsta
Copy link
Member

Easy option may be to turn on orchestrator diagnostics logging with DCP_DIAGNOSTICS_LOG_LEVEL=debug and check the timestamps from the logs.

@JamesNK
Copy link
Member

JamesNK commented Feb 23, 2024

There is more than DCP. For example, Kestrel attempts to gracefully close outstanding connections. And DisposeAsync runs on some services.

Want to measure everything. How long between shutdown start and shutdown finished in these apps.

@davidfowl
Copy link
Member Author

Moving this to p5 since we made some progress here. I still want to make sure the communication between the apphost and the dashboard and dcp are very reliable.

@davidfowl
Copy link
Member Author

Closing in favor of #2422

@github-actions github-actions bot locked and limited conversation to collaborators Apr 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication area-dashboard
Projects
None yet
Development

No branches or pull requests

8 participants