Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When two services come up in Freshping, only one gets resolved in PagerDuty #2

Closed
Aldaviva opened this issue Sep 25, 2024 · 2 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Aldaviva
Copy link
Owner

Aldaviva commented Sep 25, 2024

  1. Both HTTP (#11) and SMTP (#10) services went down, probably because of an Internet outage.
  2. FreshPager successfully triggered two incidents in PagerDuty.
  3. Both services came back up at the same time.
  4. FreshPager resolved the HTTP incident with the wrong deduplication key, and never resolved the SMTP incident.

Logs

Sep 19 01:11:54 erebus systemd[1]: Starting FreshPager...
Sep 19 01:11:56 erebus freshpager[29668]: Microsoft.Hosting.Lifetime[14] Now listening on: http://[::]:37374
Sep 19 01:11:56 erebus freshpager[29668]: Microsoft.Hosting.Lifetime[0] Application started. Hosting environment: Production; Content root path: /opt/freshpager
Sep 19 01:11:56 erebus systemd[1]: Started FreshPager.
Sep 25 07:20:31 erebus freshpager[29668]: Program[0] Aldaviva SMTP is down
Sep 25 07:20:31 erebus freshpager[29668]: Program[0] Aldaviva HTTP is down
Sep 25 07:20:31 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[100] Start processing HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:20:31 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[100] Start processing HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:20:31 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[100] Sending HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:20:31 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[100] Sending HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:20:34 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[101] Received HTTP response headers after 3157.3395ms - 202
Sep 25 07:20:34 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[101] Received HTTP response headers after 3157.3067ms - 202
Sep 25 07:20:35 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[101] End processing HTTP request after 3307.7762ms - 202
Sep 25 07:20:35 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[101] End processing HTTP request after 3307.8208ms - 202
Sep 25 07:20:35 erebus freshpager[29668]: Program[0] Triggered alert in PagerDuty for Aldaviva SMTP being down, got deduplication key 81e40b63294248c7bb43b07e6ceae4c9
Sep 25 07:20:35 erebus freshpager[29668]: Program[0] Triggered alert in PagerDuty for Aldaviva HTTP being down, got deduplication key f0cd8daed7b54a8494e4f45ba985e54e
Sep 25 07:20:35 erebus freshpager[29668]: Microsoft.AspNetCore.Http.Result.CreatedResult[1] Setting HTTP status code 201.
Sep 25 07:20:35 erebus freshpager[29668]: Microsoft.AspNetCore.Http.Result.CreatedResult[1] Setting HTTP status code 201.
Sep 25 07:23:48 erebus freshpager[29668]: Program[0] Aldaviva SMTP is available
Sep 25 07:23:48 erebus freshpager[29668]: Program[0] Aldaviva HTTP is available
Sep 25 07:23:48 erebus freshpager[29668]: Program[0] No known PagerDuty alerts for service Aldaviva HTTP, not resolving anything
Sep 25 07:23:48 erebus freshpager[29668]: Microsoft.AspNetCore.Http.Result.CreatedResult[1] Setting HTTP status code 201.
Sep 25 07:23:48 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[100] Start processing HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:23:48 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[100] Sending HTTP request POST https://events.pagerduty.com/v2/enqueue
Sep 25 07:23:49 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.ClientHandler[101] Received HTTP response headers after 252.854ms - 202
Sep 25 07:23:49 erebus freshpager[29668]: System.Net.Http.HttpClient.Default.LogicalHandler[101] End processing HTTP request after 253.6792ms - 202
Sep 25 07:23:49 erebus freshpager[29668]: Program[0] Resolved PagerDuty alert for Aldaviva SMTP being up, using deduplication key f0cd8daed7b54a8494e4f45ba985e54e
Sep 25 07:23:49 erebus freshpager[29668]: Microsoft.AspNetCore.Http.Result.CreatedResult[1] Setting HTTP status code 201.
@Aldaviva Aldaviva added the bug Something isn't working label Sep 25, 2024
@Aldaviva Aldaviva self-assigned this Sep 25, 2024
@Aldaviva
Copy link
Owner Author

Aldaviva commented Sep 25, 2024

The problem is that Freshping only keeps track of at most one deduplication key per PagerDuty service, but each PagerDuty service can map to more than one Freshping check, so it can't resolve more than one incident for a PagerDuty service.

Linearized execution

  1. Freshping check C1 goes down
  2. FreshPager triggers an incident on PagerDuty service S1 and stores the deduplication key D1 under the integration key for S1
  3. Freshping check C2 goes down
  4. FreshPager triggers an incident on PagerDuty service S1 and stores the deduplication key D2 under the integration key for S1, overwriting and erasing D1 (D1 incident can now never be automatically resolved)
  5. Freshping check C1 comes up
  6. FreshPager looks up the deduplication key D2 based on the integration key for S1, which used to map to D1 but it got overwritten by D2 in step 4. PagerDuty incident for D2 is resolved, so it looks like C2 is up.
  7. Freshping check C2 comes up
  8. FreshPager fails to look up any deduplication key for the S1 integration key, because D1 was removed in step 6 while resolving the D2 alert.
  9. FreshPager never resolves the D1 alert, so D1 stays open until a human manually resolves it.

It's sort of a race when both services go down at the same time, otherwise they would have shared one deduplication key to start with. To resolve that, we could process requests serially, but each one takes about 3 seconds. Maybe the deduplication keys should be stored with different keys instead.

@Aldaviva
Copy link
Owner Author

Trying indexing deduplication keys by Freshping check instead of by PagerDuty integration key. Deployed on two servers, will monitor for failures.

Aldaviva added a commit that referenced this issue Sep 25, 2024
… PagerDuty [index deduplication keys by check name instead of integration key]
Aldaviva added a commit that referenced this issue Sep 26, 2024
… name can be changed later in Freshping. Added tests for #2.
@Aldaviva Aldaviva added this to the 2.0.0 milestone Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant