diff --git a/docs/diagnostic-logs.md b/docs/diagnostic-logs.md index 748ee352e..74634d6f5 100644 --- a/docs/diagnostic-logs.md +++ b/docs/diagnostic-logs.md @@ -4,17 +4,24 @@ - [Prerequisites](#prerequisites) - [Set up diagnostic logs for an Azure SignalR Service](#set-up-diagnostic-logs-for-an-azure-signalr-service) - [Enable diagnostic logs](#enable-diagnostic-logs) - - [Diagnostic logs categories](#diagnostic-logs-categories) - [Diagnostic logs types](#diagnostic-logs-types) + - [Connectivity Logs](#connectivity-logs) + - [Messaging Logs](#messaging-logs) - [Diagnostic logs collecting behaviors](#diagnostic-logs-collecting-behaviors) + - [Collect all](#collect-all) + - [Configuration guide](#configuration-guide) + - [Collect partially](#collect-partially) + - [Diagnostic client](#diagnostic-client) + - [Configuration guide](#configuration-guide-1) - [Archive to a storage account](#archive-to-a-storage-account) - [Archive logs schema for Log Analytics](#archive-logs-schema-for-log-analytics) - [Troubleshooting with diagnostic logs](#troubleshooting-with-diagnostic-logs) - - [Unexpected connection number changes](#unexpected-connection-number-changes) - - [Unexpected connection dropping](#unexpected-connection-dropping) - - [Unexpected connection growing](#unexpected-connection-growing) - - [Authorization failure](#authorization-failure) - - [Throttling](#throttling) + - [Connection related issues](#connection-related-issues) + - [Unexpected connection number changes](#unexpected-connection-number-changes) + - [Authorization failure](#authorization-failure) + - [Throttling](#throttling) + - [Message related issues](#message-related-issues) + - [Message loss](#message-loss) - [Get help](#get-help) ## Prerequisites @@ -25,7 +32,7 @@ To enable diagnostic logs, you'll need somewhere to store your log data. This tu ## Set up diagnostic logs for an Azure SignalR Service -You can view diagnostic logs for Azure SignalR Service. These logs provide richer view of connectivity to your Azure SignalR Service instance. The diagnostic logs provide detailed information of every connection. For example, basic information (user ID, connection ID and transport type, etc.) and event information (connect, disconnect and abort event, etc.) of the connection. Diagnostic logs can be used for issue identification, connection tracking and analysis. +You can view diagnostic logs for Azure SignalR Service. These logs provide richer view of connectivity and messaging information to your Azure SignalR Service instance. The diagnostic logs provide detailed information for SignalR hub connections and SignalR hub messages received and sent via SignalR service. For example, basic information (user ID, connection ID and transport type, etc.) and event information (connect, disconnect and abort event, etc.) of the connection, tracing ID and type of the message. Diagnostic logs can be used for issue identification, connection tracking, message tracing and analysis. ### Enable diagnostic logs @@ -35,38 +42,113 @@ Diagnostic logs are disabled by default. To enable diagnostic logs, follow these ![Pane navigation to diagnostic settings](./images/diagnostic-logs/diagnostic-settings-menu-item.png) -1. Then click **Add diagnostic setting**. +1. Then you will get a full view of the diagnostic settings. - ![Add diagnostic logs](./images/diagnostic-logs/add-diagnostic-setting.png) + ![Diagnostic settings' full view](./images/diagnostic-logs/azure-signalr-diagnostic-settings.png) -1. Set the archive target that you want. Currently, we support **Archive to a storage account** and **Send to Log Analytics**. +1. Configure the log source settings. + 1. In **Log Source Settings** section, a table shows collecting behaviors for each log type. + 1. Check the specific log type you want to collect for all connections. Otherwise the the log will be collected only for [diagnostic clients](#diagnostic-client). +1. Configure the log destination settings. + 1. In **Log Destination Settings** section, a table of diagnostic settings display the existing diagnostic settings. You can click the link in the table to get access to the log destination to view the collected diagnostic logs. + 1. In this section, click the button **Configure Log Destination Settings** to add, update, or delete diagnostic settings. + 1. Click **Add diagnostic setting** to add a new diagnostic setting, or click **Edit** to modify an existing diagnostic setting. + 1. Set the archive target that you want. Currently, SignalR service supports **Archive to a storage account** and **Send to Log Analytics**. + 1. Select the logs you want to archive. Only `AllLogs` is available for diagnostic log. It only controls whether you want to archive the logs. To configure which log types needs to be generated in SignalR service, configure in **Log Source Settings** section. + ![Diagnostics settings pane](./images/diagnostic-logs/diagnostics-settings-pane.png) + 1. Save the new diagnostics setting. The new setting takes effect in about 10 minutes. After that, logs will be sent to configured archival target. For more information about configuring log destination settings, see the [overview of Azure diagnostic logs](https://docs.microsoft.com/azure/azure-monitor/platform/platform-logs-overview). -1. Select the logs you want to archive. +### Diagnostic logs types - ![Diagnostics settings pane](./images/diagnostic-logs/diagnostics-settings-pane.png) +Azure SignalR supports 2 types of logs: connectivity log and messaging log. +#### Connectivity Logs -1. Save the new diagnostics settings. +Connectivity logs provide detailed information for SignalR hub connections. For example, basic information (user ID, connection ID and transport type, etc.) and event information (connect, disconnect and abort event, etc.). Therefore, connectivity log is helpful to troubleshoot connection related issues. For typical connection related troubleshooting guide, see [connection related issue](#connection-related-issues). -New settings take effect in about 10 minutes. After that, logs appear in the configured archival target, in the **Diagnostics logs** pane. +#### Messaging Logs -For more information about configuring diagnostics, see the [overview of Azure diagnostic logs](../azure-monitor/platform/resource-logs-overview.md). +Messaging logs provide tracing information for the SignalR hub messages received and sent via SignalR service. For example, tracing ID and message type of the message. The tracing ID and message type is also logged in app server. Typically the message is recorded when it arrives at or leaves from service or server. Therefore messaging logs are helpful for troubleshooting message related issues. For typical message related troubleshooting guide, see [message related issues](#message-related-issues) -### Diagnostic logs categories +> This type of logs is generated for every message, if the messages are sent frequently, messaging logs might impact the performance of SignalR service. However, you can choose different collecting behaviors to minimize the performance impact. See [diagnostic logs collecting behaviors](#Diagnostic-logs-collecting-behaviors) below. -Azure SignalR Service captures diagnostic logs in one category: +### Diagnostic logs collecting behaviors -* **All Logs**: Track connections that connect to Azure SignalR Service. The logs Provide infomation about the connect/disconnect, authentication and throttling. For more information, see the next section. +There are two typical scenarios on using diagnostic logs, especially for messaging logs. -### Diagnostic logs types -[TODO] +Someone may care about the quality of each message. For example, they are sensitive on whether the message get sent/received successfully, or they want to record every message that is delivered via SignalR service. -### Diagnostic logs collecting behaviors -[TODO] +In the mean time, others may care about the performance. They are sensitive on the latency of the message, and sometimes they need to track the message in a few connections instead of all the connections for some reason. + +Therefore, SignalR service provides two kinds of collecting behaviors +* **collect all**: collect logs in all connections +* **collect partially**: collect logs in some specific connections + +> To distinguish the connections between those collect logs and those don't collect logs, SignalR service will treat some client as diagnostic client based on the diagnostic client configurations of server and client, in which the diagnostic logs always get collected, while the others don't. For more details, see [collect partially section](#collect-partially). + +#### Collect all + +Diagnostic logs are collected by all the connections. Take messaging logs for example. When this behavior is enabled, SignalR service will send a notification to server to start generating tracing ID for each message. The tracing ID will be carried in the message to the service, the service will also log the message with tracing ID. + +> Note that to ensure the performance of SignalR service, SignalR service doesn't await and parse the whole message sent from client, therefore, the client messages isn't get logged. But if the client is marked as a diagnostic client, then client message will get logged in SignalR service. + +##### Configuration guide + +To enable this behavior, check the checkbox in the *Types* section in the *Log Source Settings*. + +This behavior doesn't require you to update server side configurations. This configuration change will always be sent to server automatically. + +#### Collect partially + +Diagnostic logs are **only** collected by [diagnostic clients](#diagnostic-client). All messages get logged including client messages and connectivity events in the diagnostic clients. + +> The limit of the diagnostic clients' number is 100. If the number of diagnostic clients exceeds 100, the outnumbered diagnostic clients will get throttled by SignalR service. The new but outnumbered clients will be failed to connect to SignalR service, and throw `System.Net.Http.HttpRequestException` which has message `Response status code does not indicate success: 429 (Too Many Requests)`, while the already connected ones work without getting impacted by the throttling policy. + +##### Diagnostic client + +Diagnostic client is a logical concept, any client can be a diagnostic client. The server controls which client can be a diagnostic client. Once a client is marked as a diagnostic client, all diagnostic logs will be enabled in this client. To set a client be a diagnostic client, see the [configuration guide](#configuration-guide-1) below. + +##### Configuration guide + +To enable this behavior, you need to configure service, server, client side. + +###### Service side + +To enable this behavior, uncheck the checkbox for a specific log type in the *Types* section in the *Log Source Settings*. + +###### Server side + +Also setup `ServiceOptions.DiagnosticClientFilter` to define a filter of diagnostic clients based on the http context comes from clients. For example, make client with hub URL `?diag=yes`, then setup `ServiceOptions.DiagnosticClientFilter` to filter the diagnostic client. If it returns `true`, the client will be marked as diagnostic client; otherwise, it keeps as normal client. The `ServiceOptions.DiagnosticClientFilter` can be set in your startup class like this: + +``` +// sample: mark a client as diagnostic client when it has query string "?diag=yes" in hub URL +public IServiceProvider ConfigureServices(IServiceCollection services) +{ + services.AddMvc(); + services + .AddSignalR() + .AddAzureSignalR(o => + { + o.ConnectionString = ""; + o.DiagnosticClientFilter = context => context.Request.Query["diag"] == "yes"; + }); + + return services.BuildServiceProvider(); +} +``` +###### Client side + +Mark the client as diagnostic client by configuring the http context. For example, the client is marked as diagnostic client by adding the query string `diag=yes`. + +``` +var connection = new HubConnectionBuilder() + .WithUrl("?diag=yes") + .Build(); +``` ### Archive to a storage account -Logs are stored in the storage account that configured in **Diagnostics logs** pane. A container named `insights-logs-alllogs` is created automatically to store diagnostic logs. Inside the container, logs are stored in the file `resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/XXXX/PROVIDERS/MICROSOFT.SIGNALRSERVICE/SIGNALR/XXX/y=YYYY/m=MM/d=DD/h=HH/m=00/PT1H.json`. Basically, the path is combined by `resource ID` and `Date Time`. The log files are splitted by `hour`. Therefore, the minutes always be `m=00`. +Logs are stored in the storage account that configured in **Diagnostics logs** pane. A container named `insights-logs-alllogs` is created automatically to store diagnostic logs. Inside the container, logs are stored in the file `resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/XXXX/PROVIDERS/MICROSOFT.SIGNALRSERVICE/SIGNALR/XXX/y=YYYY/m=MM/d=DD/h=HH/m=00/PT1H.json`. Basically, the path is combined by `resource ID` and `Date Time`. The log files are split by `hour`. Therefore, the minutes always be `m=00`. All logs are stored in JavaScript Object Notation (JSON) format. Each entry has string fields that use the format described in the following sections. @@ -78,23 +160,25 @@ time | Log event time level | Log event level resourceId | Resource ID of your Azure SignalR Service location | Location of your Azure SignalR Service -category | Catagory of the log event +category | Category of the log event operationName | Operation name of the event callerIpAddress | IP address of your server/client -properties | Detailed properties related to this log event. For more detail, see [`Properties Table`](#properties-table) +properties | Detailed properties related to this log event. For more detail, see [**properties tables**](#properties-tables) below - -**Properties Table** + +#### Properties tables Name | Description ------- | ------- -type | Type of the log event. Currently, we provide information about connectivity to the Azure SignalR Service. Only `ConnectivityLogs` type is available -collection | Collection of the log event. Allowed values are: `Connection`, `Authorization` and `Throttling` -connectionId | Identity of the connection -transportType | Transport type of the connection. Allowed values are: `Websockets` \| `ServerSentEvents` \| `LongPolling` -connectionType | Type of the connection. Allowed values are: `Server` \| `Client`. `Server`: connection from server side; `Client`: connection from client side -userId | Identity of the user -message | Detailed message of log event +type | Required. Type of the log event. SignalR service provides information about connectivity to the Azure SignalR Service. Allowed value are `ConnectivityLogs` and `MessagingLogs` +collection | Required. Collection of the log event. Allowed values are: `Connection`, `Authorization` and `Throttling`, `Message` +message | Required. Detailed message of log event +connectionId | Optional. Identity of the connection +transportType | Optional. Transport type of the connection. Allowed values are: `Websockets` \| `ServerSentEvents` \| `LongPolling` +connectionType | Optional. Type of the connection. Allowed values are: `Server` \| `Client`. `Server`: connection from server side; `Client`: connection from client side +userId | Optional. Identity of the user +messageType | Optional. Type of the message, only available for the message sent from server. Allowed values are: `BroadcastDataMessage`, `MultiConnectionDataMessage`, `GroupBroadcastDataMessage`, `MultiGroupBroadcastDataMessage`, `GroupBroadcastDataMessage`, `UserDataMessage`, `MultiUserDataMessage`, `JoinGroupWithAckMessage` and `LeaveGroupWithAckMessage` +messageTracingId | Optional. Tracing ID of message The following code is an example of an archive log JSON string: @@ -123,11 +207,8 @@ The following code is an example of an archive log JSON string: To view diagnostic logs, follow these steps: -1. Click `Logs` in your target Log Analytics. - - ![Log Analytics menu item](./images/diagnostic-logs/log-analytics-menu-item.png) - -1. Enter `SignalRServiceDiagnosticLogs` and select time range to query diagnostic logs. For advanced query, please see [Get started with Log Analytics in Azure Monitor](https://docs.microsoft.com/en-us/azure/azure-monitor/log-query/get-started-portal) +1. Open the Log Analytics workspace that is selected as a log target. +1. Click `Logs` in your target Log Analytics workspace. ![Query log in Log Analytics](./images/diagnostic-logs/query-log-in-log-analytics.png) @@ -135,29 +216,31 @@ Archive log columns include elements listed in the following table: Name | Description ------- | ------- -TimeGenerated | Log event time -Collection | Collection of the log event. Allowed values are: `Connection`, `Authorization` and `Throttling` -OperationName | Operation name of the event -Location | Location of your Azure SignalR Service -Level | Log event level -CallerIpAddress | IP address of your server/client -Message | Detailed message of log event -UserId | Identity of the user -ConnectionId | Identity of the connection -ConnectionType | Type of the connection. Allowed values are: `Server` \| `Client`. `Server`: connection from server side; `Client`: connection from client side -TransportType | Transport type of the connection. Allowed values are: `Websockets` \| `ServerSentEvents` \| `LongPolling` +TimeGenerated | Required. Log event time +Collection | Required. Collection of the log event. Allowed values are: `Connection`, `Authorization`, `Throttling` and `Message` +OperationName | Required. Operation name of the event +Location | Required. Location of your Azure SignalR Service +Level | Required. Log event level +Message | Required. Detailed message of log event +CallerIpAddress | Required. IP address of your server/client +UserId | Optional. Identity of the user +ConnectionId | Optional. Identity of the connection +ConnectionType | Optional. Type of the connection. Allowed values are: `Server` \| `Client`. `Server`: connection from server side; `Client`: connection from client side +TransportType | Optional. Transport type of the connection. Allowed values are: `Websockets` \| `ServerSentEvents` \| `LongPolling` ### Troubleshooting with diagnostic logs -To troubleshoot for Azure SignalR Service, you can enable server/client side logs to capture failures. At present, Azure SiganlR Service exposes diagnostic logs, you can also enable logs for service side. +To troubleshoot for Azure SignalR Service, you can enable server/client side logs to capture failures. At present, Azure SignalR Service exposes diagnostic logs, you can also enable logs for service side. + +#### Connection related issues -When encountering connection unexpected growing or dropping situation, you can take advantage of diagnostic logs to troubleshoot. +When encountering connection unexpected growing or dropping situation, you can take advantage of connectivity logs to troubleshoot. Typical issues are often about connections's unexpected quantity changes, connections reach connection limits and authorization failure. See the next sections about how to troubleshoot. -#### Unexpected connection number changes +##### Unexpected connection number changes -##### Unexpected connection dropping +###### Unexpected connection dropping If you encounter unexpected connections drop, firstly enable logs in service, server and client sides. @@ -176,28 +259,115 @@ Service reloading, please reconnect | Azure SignalR Service is reloading. Azure Internal server transient error | Transient error occurs in Azure SignalR Service, should be auto-recovered Server connection dropped | Server connection drops with unknown error, consider self-troubleshooting with service/server/client side log first. Try to exclude basic issues (e.g Network issue, app server side issue, etc.). If the issue isn't resolved, contact us for further help. For more information, see [Get help](get-help) section. -##### Unexpected connection growing +###### Unexpected connection growing To troubleshoot about unexpected connection growing, the first thing you need to do is filter out the extra connections. You can add unique test user ID to your test client connection. Then verify it in with diagnostic logs, you see more than one client connections have the same test user ID or IP, then it is likely the client side create and establish more connections than expectation. Check your client side. -#### Authorization failure +##### Authorization failure If you get 401 Unauthorized returned for client requests, check your diagnostic logs. If you encounter `Failed to validate audience. Expected Audiences: . Actual Audiences: `, it means your all audiences in your access token is invalid. Try to use the valid audiences suggested in the log. -#### Throttling +##### Throttling If you find that you cannot establish SignalR client connections to Azure SignalR Service, check your diagnostic logs. If you encounter `Connection count reaches limit` in diagnostic log, you establish too many connections to SignalR Service, which reach the connection count limit. Consider scaling up your SignalR Service. If you encounter `Message count reaches limit` in diagnostic log, it means you use free tier, and you use up the quota of messages. If you want to send more messages, consider changing your SignalR Service to standard tier to send additional messages. For more details, see [Azure SignalR Service Pricing](https://azure.microsoft.com/en-us/pricing/details/signalr-service/). +#### Message related issues + +When encountering message related problem, you can take advantage of messaging logs to troubleshoot. Firstly, [enable diagnostic logs](#enable-diagnostic-logs) in service, logs for server and client. + +> For ASP.NET Core, see [here](https://docs.microsoft.com/aspnet/core/signalr/diagnostics) to enable logging in server and client. +> +> For ASP.NET, see [here](https://docs.microsoft.com/aspnet/signalr/overview/testing-and-debugging/enabling-signalr-tracing) to enable logging in server and client. + +If you don't mind potential performance impact and no client-to-server direction message, check the `Messaging` in `Log Source Settings/Types` to enable *collect-all* log collecting behavior. For more information about this behavior, see [collect all section](#collect-all). + +Otherwise, uncheck the `Messaging` to enable *collect-partially* log collecting behavior. This behavior requires configuration in client and server to enable it. For more information, see [collect partially section](#collect-partially). + +##### Message loss + +If you encounter message loss problem, the key is to locate the place where you lose the message. Basically, you have 3 components when using SignalR service: SignalR service, server and client. Both server and client are connected to SignalR service, they don't connected to each other directly once negotiation is completed. Therefore, we need to consider 2 directions for messages, for each direction, we need to consider 2 paths: + +* From client to server via SignalR service + * Path 1: Client to SignalR service + * Path 2: SignalR service to server +* From server to client via SignalR service + * Path 3: Server to SignalR service + * Path 4: SignalR service to client + +![Message path](./images/diagnostic-logs/message-path.png) + +For **collect all** collecting behavior: + +SignalR service only trace messages in direction **from server to client via SignalR service**. The tracing ID will be generated in server, the message will carry the tracing ID to SignalR service. + +> If you want to trace message and [send messages from outside a hub](https://docs.microsoft.com/en-us/aspnet/core/signalr/hubcontext) in your app server, you need to enable **collect all** collecting behavior to collect message logs for the messages which are not originated from diagnostic clients. +> Diagnostic clients works for both **collect all** and **collect partially** collecting behaviors. It has higher priority to collect logs. For more information, see [diagnostic client section](#diagnostic-client). + +By checking the log in server and service side, you can easily find out whether the message is sent from server, arrives at SignalR service, and leaves from SignalR service. Basically, by checking if the *received* and *sent* message are matched or not based on message tracing Id, you can tell whether the message loss issue is in server or SignalR service in this direction. For more information, see the [details](#message-flow-detail-for-path3) below. + +For **collect partially** collecting behavior: + +Once you mark the client as diagnostic client, SignalR service will trace messages in both directions. + +By checking the log in server and service side, you can easily find out whether the message is pass the server or SignalR service successfully. Basically, by checking if the *received* and *sent* message are matched or not based on message tracing Id, you can tell whether the message loss issue is in server or SignalR service. For more information, see the details below. + +**Details of the message flow** + +For the direction **from client to server via SignalR service**, SignalR service will **only** consider the invocation that is originated from diagnostic client, that is, the message generated directly in diagnostic client, or service message generated due to the invocation of diagnostic client indirectly. + +The tracing ID will be generated in SignalR service once the message arrives at SignalR service in **Path 1**. SignalR service will generate a log `Received a message from client connection .` for each message in diagnostic client. Once the message leaves from the SignalR to server, SignalR service will generate a log `Sent a message to server connection successfully.` If you see these 2 logs, you can be sure that the message passes through SignalR service successfully. + +> Due to the limitation of ASP.NET Core SignalR, the message comes from client doesn't contains any message level ID. But ASP.NET SignalR generate *invocation ID* for each message, you can use it to map with the tracing ID. + +Then the message carries the tracing ID Server in **Path 2**. Server will generate a log `Received message from client connection ` once the message arrives. + + +Once the message invokes the hub method in server, a new service message will be generated with a *new tracing ID*. Once the service message is generated, server will generate a log in template `Start to broadcast/send message ...`, the actual log will be based on your scenario. Then the message will be delivered to SignalR service in **Path 3**, once the service message leaves from server, a log called `Succeeded to send message ` will be generated. + +> The tracing ID of the message from client cannot map to the tracing ID of the service message to be sent to SignalR service. + +Once the service message arrives at SignalR service, a log called `Received a message from server connection .` will be generated. Then SignalR service processes the service message and deliver to the target client(s). Once the message is sent to client(s) in **Path 4**, log `Sent a message to client connection successfully.` will be generated. + +In summary, the message log will be generated when message goes in and out the SignalR service and server. You can use these logs to validate whether the message is lost in these components or not. + +Below is a typical message loss issue. + +###### A client fails to receive messages in a group + +The typical story in this issue is that the client joins a group **after** sending a group message. + +``` +Class Chat : Hub +{ + public void JoinAndSendGroup(string name, string groupName) + { + Groups.AddToGroupAsync(Context.ConnectionId, groupName); // join group + Clients.Group(groupName).SendAsync("ReveiceGroupMessage", name, "I'm in group"); // send group message + } +} +``` + +For example, someone may make invocations of *join group* and *send group message* in the same hub method. The problem here is the `AddToGroupAsync` is an `async` method. There's no `await` for the `AddToGroupAsync` to wait it finishes, the group message sent before `AddToGroupAsync` completes. Due to network delay, and the delay of the process of joining client to some group, the join group action may complete later than group message delivery. If so, the first group message won't have any client as receiver, since no client has joined the group. So it'll become a message lost issue. + +Without diagnostic logs, you are unable to find out when the client joins the group and when the group message is sent. +Once you enable messaging logs, you are able to compare the message arriving time in SignalR service. Follow the below steps to troubleshoot: +1. Find the message logs in server to find when the client joined the group and when the group message is sent. +1. Get the message tracing ID A of joining the group and the message tracing ID B of group message from the message logs. +1. Filter these message tracing ID among messaging logs in your log archive target, then compare their arriving timestamps, you will find which message message is arrived first in SignalR service. +1. If message tracing ID A's arriving time later than B's, then you must be sending group message **before** the client joining the group.Then you need to make sure the client is in the group before sending group messages. + +If a message get lost in SignalR or server, try to get the warning logs based on the message tracing ID to get the reason. If you need further help, see the [get help section](#get-help). + ### Get help -We recommend you troubleshoot by yourself first. Most isssues are caused by app server or network issues. Please follow [troubleshooting guide with diagnostic log](#troubleshooting-with-diagnostic-logs) and [basic trouble shooting guide](./tsg.md) to find the root cause. -If the issue still can't be resolved, then consider open an issue in github or create ticket in Azure Portal. +We recommend you troubleshoot by yourself first. Most issues are caused by app server or network issues. Please follow [troubleshooting guide with diagnostic log](#troubleshooting-with-diagnostic-logs) and [basic trouble shooting guide](./tsg.md) to find the root cause. +If the issue still can't be resolved, then consider open an issue in GitHub or create ticket in Azure Portal. Please provide: 1. Time range about 30 minutes when the issue occurs -2. Azure SignalR Service's resource ID -3. Issue details, as specifically as possible: e.g. appserver doesn't send messages, client connection drops, etc. -4. Logs collected from server/client side, and other material that might be useful -5. [Optional] Repro code +1. Azure SignalR Service's resource ID +1. Issue details, as specifically as possible: e.g. app server doesn't send messages, client connection drops, etc. +1. Logs collected from server/client side, and other material that might be useful +1. [Optional] Repro code -> Note: if you open issue in github, keep your sensitive information (e.g. resource ID, server/client logs) private, only send to members in Microsoft organization privately. +> Note: if you open issue in GitHub, keep your sensitive information (e.g. resource ID, server/client logs) private, only send to members in Microsoft organization privately. diff --git a/docs/images/diagnostic-logs/azure-signalr-diagnostic-settings.png b/docs/images/diagnostic-logs/azure-signalr-diagnostic-settings.png new file mode 100644 index 000000000..8279a08c0 Binary files /dev/null and b/docs/images/diagnostic-logs/azure-signalr-diagnostic-settings.png differ diff --git a/docs/images/diagnostic-logs/message-path.png b/docs/images/diagnostic-logs/message-path.png new file mode 100644 index 000000000..574ccc3d0 Binary files /dev/null and b/docs/images/diagnostic-logs/message-path.png differ diff --git a/docs/sharding.md b/docs/sharding.md index be4ec07f0..d6cda8f9a 100644 --- a/docs/sharding.md +++ b/docs/sharding.md @@ -11,6 +11,7 @@ In latest SDK, we add support for configuring multiple SignalR service instances * [How to add multiple endpoints from code](#aspnet-code) * [How to customize endpoint router](#aspnet-customize-router) * [Configuration in cross-geo scenarios](#cross-geo) +* [Dynamic Scale ServiceEndpoints](#dynamic-scale) * [Failover](#failover) ## For ASP.NET Core @@ -112,6 +113,19 @@ private class CustomRouter : EndpointRouterDecorator } ``` +From version 1.6.0, we're exposing metrics synced from service side to help with customized routing for balancing and load use. So you can select the endpoints with minimal clients in below sample. + +```cs +private class CustomRouter : EndpointRouterDecorator +{ + public override ServiceEndpoint GetNegotiateEndpoint(HttpContext context, IEnumerable endpoints) + { + return endpoints.OrderBy(x => x.EndpointMetrics.ClientConnectionCount).FirstOrDefault(x => x.Online) // Get the available endpoint with minimal clients load + ?? base.GetNegotiateEndpoint(context, endpoints); // Or fallback to the default behavior to randomly select one from primary endpoints, or fallback to secondary when no primary ones are online + } +} +``` + And don't forget to register the router to DI container using: ```cs @@ -202,6 +216,20 @@ private class CustomRouter : EndpointRouterDecorator } } ``` + +Another example about you can select the endpoints with minimal clients, supported from version 1.6.0. + +```cs +private class CustomRouter : EndpointRouterDecorator +{ + public override ServiceEndpoint GetNegotiateEndpoint(HttpContext context, IEnumerable endpoints) + { + return endpoints.OrderBy(x => x.EndpointMetrics.ClientConnectionCount).FirstOrDefault(x => x.Online) // Get the available endpoint with minimal clients load + ?? base.GetNegotiateEndpoint(context, endpoints); // Or fallback to the default behavior to randomly select one from primary endpoints, or fallback to secondary when no primary ones are online + } +} +``` + And don't forget to register the router to DI container using: ```cs @@ -235,6 +263,18 @@ In cross-geo scenario, when a client `/negotiate` with the app server hosted in ![Normal Negotiate](./images/normal_negotiate.png) +## Dynamic Scale ServiceEndpoints + + +From version 1.5.0, we're enabling dynamic scale ServiceEndpoints for ASP.NET Core version first. So you don't have to restart app server when you need to add/remove a ServiceEndpoint. As ASP.NET Core is supporting default configuration like `appsettings.json` with `reloadOnChange: true`, you don't need to change a code and it's supported by nature. And if you'd like to add some customized configuration and work with hot-reload, please refer to [this](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-3.1). + +> Note +> +> Considering the time of connection set-up between server/service and client/service may be a few difference, to ensure no message loss during the scale process, we have a staging period waiting for server connection be ready before open the new ServiceEndpoint to clients. Usually it takes seconds to complete and you'll be able to see log like `Succeed in adding endpoint: '{endpoint}'` which indicates the process completes. But for some unexpected reasons like cross-region network issue or configuration inconsistent on different app servers, the staging period will not be able to finish correctly. Since limited things can be done during the dynamic scale process, we choose to promote the scale as it is. It's suggested to restart App Server when you find the scaling process not working correctly. +> +> The default timeout period for the scale is 5 minutes, and it can be customized by setting the value in [`ServiceOptions.ServiceScaleTimeout`](https://github.com/Azure/azure-signalr/blob/dev/docs/use-signalr-service.md#servicescaletimeout). If you have a lot of app servers, it's suggested to extend the value a little more. + + ## Failover diff --git a/docs/tsg.md b/docs/tsg.md index 04d2a3121..a7ec3bf4f 100644 --- a/docs/tsg.md +++ b/docs/tsg.md @@ -1,337 +1,3 @@ # Troubleshooting Guide -This guidance is to provide useful troubleshooting guide based on the common issues customers encountered and resolved in the past years. - -- [Access token too long](#access_token_too_long) -- [TLS 1.2 required](#tls_1.2_required) -- [400 Bad Request returned for client requests](#400_bad_request) -- [401 Unauthorized returned for client requests](#401_unauthorized_returned_for_client_requests) -- [404 returned for client requests](#random_404_returned_for_client_requests) -- [404 returned for ASP.NET SignalR's reconnect request](#reconnect_404) -- [413 returned for REST API requests](#413_rest) -- [429 Too Many Requests returned for client requests](#429_too_many_requests) -- [500 Error when negotiate](#500_error_when_negotiate) -- [Client connection drops](#client_connection_drop) -- [Client connection increases constantly](#client_connection_increases_constantly) -- [Server connection drops](#server_connection_drop) - - -## Access token too long - -### Possible errors: - -1. Client-side `ERR_CONNECTION_` -2. 414 URI Too Long -3. 413 Payload Too Large -4. Access Token must not be longer than 4K. 413 Request Entity Too Large - -### Root cause: -For HTTP/2, the max length for a single header is **4K**, so if you are using browser to access Azure service, you will encounter this limitation with `ERR_CONNECTION_` error. - -For HTTP/1.1, or C# clients, the max URI length is **12K**, the max header length is **16K**. - -With SDK version **1.0.6** or higher, `/negotiate` will throw `413 Payload Too Large` when the generated access token is larger than **4K**. - -### Solution: -By default, claims from `context.User.Claims` are included when generating JWT access token to **ASRS**(**A**zure **S**ignal**R** **S**ervice), so that the claims are preserved and can be passed from **ASRS** to the `Hub` when the client connects to the `Hub`. - -In some cases, `context.User.Claims` are leveraged to store lots of information for app server, most of which are not used by `Hub`s but by other components. - -The generated access token is passed through the network, and for WebSocket/SSE connections, access tokens are passed through query strings. So as the best practice, we suggest only passing **necessary** claims from the client through **ASRS** to your app server when the Hub needs. - -There is a `ClaimsProvider` for you to customize the claims passing to **ASRS** inside the access token. - -For ASP.NET Core: -```cs -services.AddSignalR() - .AddAzureSignalR(options => - { - // pick up necessary claims - options.ClaimsProvider = context => context.User.Claims.Where(...); - }); -``` - -For ASP.NET: -```cs -services.MapAzureSignalR(GetType().FullName, options => - { - // pick up necessary claims - options.ClaimsProvider = context.Authentication?.User.Claims.Where(...); - }); -``` - - -## TLS 1.2 required - -### Possible errors: - -1. ASP.Net "No server available" error [#279](https://github.com/Azure/azure-signalr/issues/279) -2. ASP.Net "The connection is not active, data cannot be sent to the service." error [#324](https://github.com/Azure/azure-signalr/issues/324) -3. "An error occurred while making the HTTP request to https://. This could be due to the fact that the server certificate is not configured properly with HTTP.SYS in the HTTPS case. This could also be caused by a mismatch of the security binding between the client and the server." - -### Root cause: -Azure Service only supports TLS1.2 for security concerns. With .NET framework, it is possible that TLS1.2 is not the default protocol. As a result, the server connections to ASRS can not be successfully established. - -### Troubleshooting Guide -1. If this error can be repro-ed locally, uncheck *Just My Code* and throw all CLR exceptions and debug the app server locally to see what exception throws. - * Uncheck *Just My Code* - - ![Uncheck Just My Code](./images/uncheck_just_my_code.png) - * Throw CLR exceptions - - ![Throw CLR exceptions](./images/throw_clr_exceptions.png) - * See the exceptions throw when debugging the app server side code: - - ![Exception throws](./images/tls_throws.png) - -2. For ASP.NET ones, you can also add following code to your `Startup.cs` to enable detailed trace and see the errors from the log. -```cs -app.MapAzureSignalR(this.GetType().FullName); -// Make sure this switch is called after MapAzureSignalR -GlobalHost.TraceManager.Switch.Level = SourceLevels.Information; -``` - -### Solution: - -Add following code to your Startup: -```cs -ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12; -``` - - -## 400 Bad Request returned for client requests -### Root cause -Check if your client request has multiple `hub` query string. `hub` is a preserved query parameter and 400 will throw if the service detects more than one `hub` in the query. - - -## 401 Unauthorized returned for client requests -### Root cause -Currently the default value of JWT token's lifetime is 1 hour. - -For ASP.NET Core SignalR, when it is using WebSocket transport type, it is OK. - -For ASP.NET Core SignalR's other transport type, SSE and long-polling, this means by default the connection can at most persist for 1 hour. - -For ASP.NET SignalR, the client sends a `/ping` KeepAlive request to the service from time to time, when the `/ping` fails, the client **aborts** the connection and never reconnect. This means, for ASP.NET SignalR, the default token lifetime makes the connection lasts for **at most** 1 hour for all the transport type. - -### Solution - -For security concerns, extend TTL is not encouraged. We suggest adding reconnect logic from the client to restart the connection when such 401 occurs. When the client restarts the connection, it will negotiate with app server to get the JWT token again and get a renewed token. - -Check [here](#restart_connection) for how to restart client connections. - - -## 404 returned for client requests - -For a SignalR persistent connection, it first `/negotiate` to Azure SignalR service and then establishes the real connection to Azure SignalR service. - -### Troubleshooting Guide -1. Following [How to view outgoing requests](#view_request) to get the request from the client to the service. -1. Check the URL of the request when 404 occurs. If the URL is targeting to your web app, and similar to `{your_web_app}/hubs/{hubName}`, check if the client `SkipNegotiation` is `true`. When using Azure SignalR, the client receives redirect URL when it first negotiates with the app server. The client should **NOT** skip negotiation when using Azure SignalR. -1. Another 404 can happen when the connect request is handled more than **5** seconds after `/negotiate` is called. Check the timestamp of the client request, and open an issue to us if the request to the service has a very slow response. - - -## 404 returned for ASP.NET SignalR's reconnect request -For ASP.NET SignalR, when the [client connection drops](#client_connection_drop), it reconnects using the same `connectionId` for 3 times before stopping the connection. `/reconnect` can help if the connection is dropped due to network intermittent issues that `/reconnect` can reestablish the persistent connection successfully. Under other circumstances, for example, the client connection is dropped due to the routed server connection is dropped, or SignalR Service has some internal errors like instance restart/failover/deployment, the connection no longer exists, thus `/reconnect` returns `404`. It is the expected behavior for `/reconnect` and after 3 times retry the connection stops. We suggest having [connection restart](#restart_connection) logic when connection stops. - - -## 413 returned for REST API requests - -413 returns if your request body is larger than 1MB. - -For REST API see [limitation](rest-api.md#Limitation). - - -## 429(Too Many Requests) returned for client requests - -There are two cases. - -### **Concurrent** connection count exceeds limit. - -* For **Free** instances, **Concurrent** connection count limit is 20. -* For **Standard** instances, **concurrent** connection count limit **per unit** is 1K, which means Unit100 allows 100K **concurrent** connections. - -The connections include both client and server connections. -Check [here](https://docs.microsoft.com/en-us/azure/azure-signalr/signalr-concept-messages-and-connections#how-connections-are-counted) for how connections are counted. - -### Too many negotiate requests at the same time. - -We suggest having a random delay before reconnecting, please check [here](#restart_connection) for retry samples. - - -## 500 Error when negotiate: Azure SignalR Service is not connected yet, please try again later. -### Root cause -This error is reported when there is no server connection to Azure SignalR Service connected. - -### Troubleshooting Guide -Please enable server-side trace to find out the error details when the server tries to connect to Azure SignalR Service. - -#### Enable server side logging for ASP.NET Core SignalR -Server side logging for ASP.NET Core SignalR integrates with the `ILogger` based [logging](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-2.1&tabs=aspnetcore2x) provided in the ASP.NET Core framework. You can enable server side logging by using `ConfigureLogging`, a sample usage as follows: -```cs -.ConfigureLogging((hostingContext, logging) => - { - logging.AddConsole(); - logging.AddDebug(); - }) -``` -Logger categories for Azure SignalR always starts with `Microsoft.Azure.SignalR`. To enable detailed logs from Azure SignalR, configure the preceding prefixes to `Debug` level in your **appsettings.json** file like below: -```JSON -{ - "Logging": { - "LogLevel": { - ... - "Microsoft.Azure.SignalR": "Debug", - ... - } - } -} -``` - -#### Enable server side traces for ASP.NET SignalR -When using SDK version >= `1.0.0`, you can enable traces by adding the following to `web.config`: ([Details](https://github.com/Azure/azure-signalr/issues/452#issuecomment-478858102)) -```xml - - - - - - - - - - - - - - - - - - -``` - -## Client connection drops - -When the client is connected to the Azure SignalR, the persistent connection between the client and Azure SignalR can sometimes drop for different reasons. This section describes several possibilities causing such connection drop and provides some guidance on how to identify the root cause. - -### Possible errors seen from the client-side -1. `The remote party closed the WebSocket connection without completing the close handshake` -2. `Service timeout. 30.00ms elapsed without receiving a message from service.` -3. `{"type":7,"error":"Connection closed with an error."}` -4. `{"type":7,"error":"Internal server error."}` - -### Root cause: -Client connections can drop under various circumstances: -1. When `Hub` throws exceptions with the incoming request. -2. When the server connection the client routed to drops, see below section for details on [server connection drops](#server_connection_drop). -3. When a network connectivity issue happens between client and SignalR Service. -4. When SignalR Service has some internal errors like instance restart, failover, deployment, and so on. - -### Troubleshooting Guide -1. Open app server-side log to see if anything abnormal took place -2. Check app server-side event log to see if the app server restarted -3. Create an issue to us providing the time frame, and email the resource name to us - - - -## Client connection increases constantly -It might be caused by improper usage of client connection. If someone forgets to stop/dispose SignalR client, the connection remains open. - -### Possible errors seen from the SignalR's metrics blade -Client connections rise constantly for a long time in Azure SignalR's metrics blade. -![client_connection_increasing_constantly](./images/client_connection_increasing_constantly.jpg) - -### Root cause: -SignalR client connection's `DisposeAsync` never be called, the connection keeps open. - -### Troubleshooting Guide -1. Check if the SignalR client **never** close. - -### Solution -Check if you close connection. Please manually call `HubConnection.DisposeAsync()` to stop the connection after using it. - -For example: - -```C# -var connection = new HubConnectionBuilder() - .WithUrl(...) - .Build(); -try -{ - await connection.StartAsync(); - // Do your stuff - await connection.StopAsync(); -} -finally -{ - await connection.DisposeAsync(); -} -``` - -### Common Improper Client Connection Usage - -#### Azure Function Example -This issue often occurs when someone establishes SignalR client connection in Azure Function method instead of making it a static member to your Function class. You might expect only one client connection is established, but you see client connection count increases constantly in metrics blade, all these connections drop only after the Azure Function or Azure SignalR service restarts. This is because for **each** request, Azure Function creates **one** client connection, if you don't stop client connection in Function method, the client keeps the connections alive to Azure SignalR service. - -#### Solution -1. Remember to close client connection if you use SignalR clients in Azure function or use SignalR client as a singleton. -1. Instead of using SignalR clients in Azure function, you can create SignalR clients anywhere else and use [Azure Functions Bindings for Azure SignalR Service](https://github.com/Azure/azure-functions-signalrservice-extension) to [negotiate](https://github.com/Azure/azure-functions-signalrservice-extension/blob/dev/samples/simple-chat/csharp/FunctionApp/Functions.cs#L22) the client to Azure SignalR. And you can also utilize the binding to [send messages](https://github.com/Azure/azure-functions-signalrservice-extension/blob/dev/samples/simple-chat/csharp/FunctionApp/Functions.cs#L40). Samples to negotiate client and send messages can be found [here](https://github.com/Azure/azure-functions-signalrservice-extension/tree/dev/samples). Further information can be found [here](https://github.com/Azure/azure-functions-signalrservice-extension). -1. When you use SignalR clients in Azure function, there might be a better architecture to your scenario. Check if you design a proper serverless architecture. You can refer to [Real-time serverless applications with the SignalR Service bindings in Azure Functions](https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.SignalRService). - - -## Server connection drops - -When the app server starts, in the background, the Azure SDK starts to initiate server connections to the remote Azure SignalR. As described in [Internals of Azure SignalR Service](internal.md), Azure SignalR routes incoming client traffics to these server connections. Once a server connection is dropped, all the client connections it serves will be closed too. - -As the connections between the app server and SignalR Service are persistent connections, they may experience network connectivity issues. In the Server SDK, we have **Always Reconnect** strategy to server connections. As the best practice, we also encourage users to add continuous reconnect logic to the clients with a random delay time to avoid massive simultaneous requests to the server. - -On a regular basis there are new version releases for the Azure SignalR Service, and sometimes the Azure wide OS patching or upgrades or occasionally interruption from our dependent services. These may bring in a very short period of service disruption, but as long as client-side has the disconnect/reconnect mechanism, the impact is minimal like any client-side caused disconnect-reconnect. - -This section describes several possibilities leading to server connection drop and provides some guidance on how to identify the root cause. - -### Possible errors seen from server-side: -1. `[Error]Connection "..." to the service was dropped` -2. `The remote party closed the WebSocket connection without completing the close handshake` -3. `Service timeout. 30.00ms elapsed without receiving a message from service.` - -### Root cause: -Server-service connection is closed by **ASRS**(**A**zure **S**ignal**R** **S**ervice). - -### Troubleshooting Guide -1. Open app server-side log to see if anything abnormal took place -2. Check app server-side event log to see if the app server restarted -3. Create an issue to us providing the time frame, and email the resource name to us - -## Tips - - -* How to view the outgoing request from client? -Take ASP.NET Core one for example (ASP.NET one is similar): - 1. From browser: - - Take Chrome as an example, you can use **F12** to open the console window, and switch to **Network** tab. You might need to refresh the page using **F5** to capture the network from the very beginning. - - ![Chrome View Network](./images/chrome_network.gif) - - 2. From C# client: - - You can view local web traffics using [Fiddler](https://www.telerik.com/fiddler). WebSocket traffics are supported since Fiddler 4.5. - - ![Fiddler View Network](./images/fiddler_view_network.png) - - - -* How to restart client connection? - - Here are the [Sample codes](../samples/) containing restarting connection logic with *ALWAYS RETRY* strategy: - - * [ASP.NET Core C# Client](../samples/ChatSample/ChatSample.CSharpClient/Program.cs#L64) - - * [ASP.NET Core JavaScript Client](../samples/ChatSample/ChatSample/wwwroot/index.html#L164) - - * [ASP.NET C# Client](../samples/AspNet.ChatSample/AspNet.ChatSample.CSharpClient/Program.cs#L78) - - * [ASP.NET JavaScript Client](../samples/AspNet.ChatSample/AspNet.ChatSample.JavaScriptClient/wwwroot/index.html#L71) - +This article has been moved to [here](https://docs.microsoft.com/azure/azure-signalr/signalr-howto-troubleshoot-guide). diff --git a/docs/use-signalr-service.md b/docs/use-signalr-service.md index 98ddbd049..072bcfe20 100644 --- a/docs/use-signalr-service.md +++ b/docs/use-signalr-service.md @@ -174,8 +174,8 @@ services.AddSignalR() options.AccessTokenLifetime = TimeSpan.FromDays(1); options.ClaimsProvider = context => context.User.Claims; - option.GracefulShutdown.Mode = GracefulShutdownMode.WaitForClientsClose; - option.GracefulShutdown.Timeout = TimeSpan.FromSeconds(10); + options.GracefulShutdown.Mode = GracefulShutdownMode.WaitForClientsClose; + options.GracefulShutdown.Timeout = TimeSpan.FromSeconds(10); }); ``` diff --git a/samples/ChatSample/ChatSample/Startup.cs b/samples/ChatSample/ChatSample/Startup.cs index b8cb34ccc..9d62184db 100644 --- a/samples/ChatSample/ChatSample/Startup.cs +++ b/samples/ChatSample/ChatSample/Startup.cs @@ -1,6 +1,7 @@ using System; using Microsoft.AspNetCore.Builder; using Microsoft.AspNetCore.Hosting; +using Microsoft.AspNetCore.SignalR; using Microsoft.Azure.SignalR; using Microsoft.Extensions.DependencyInjection; using Microsoft.Extensions.Hosting; @@ -18,7 +19,12 @@ public void ConfigureServices(IServiceCollection services) .AddAzureSignalR(option => { option.GracefulShutdown.Mode = GracefulShutdownMode.WaitForClientsClose; - option.GracefulShutdown.Timeout = TimeSpan.FromSeconds(10); + option.GracefulShutdown.Timeout = TimeSpan.FromSeconds(30); + + option.GracefulShutdown.Add(async (c) => + { + await c.Clients.All.SendAsync("exit"); + }); }) .AddMessagePackProtocol(); } diff --git a/samples/ChatSample/ChatSample/wwwroot/index.html b/samples/ChatSample/ChatSample/wwwroot/index.html index b71043a24..e8a78cc93 100644 --- a/samples/ChatSample/ChatSample/wwwroot/index.html +++ b/samples/ChatSample/ChatSample/wwwroot/index.html @@ -28,7 +28,17 @@

+ + + +