Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Add failure_reason label to ingress_controller_configuration_push_count metric #2965

Merged
merged 13 commits into from
Sep 27, 2022

Conversation

czeslavo
Copy link
Contributor

@czeslavo czeslavo commented Sep 22, 2022

What this PR does / why we need it:

  • adds failure_reason label to the already existing ingress_controller_configuration_push_count metric which is reported every time configuration is pushed to the data plane
  • the label enables distinguishing config conflicts reasons: conflict, network, other.

Which issue this PR fixes:

Part of #2484

Special notes for your reviewer:

I was able to trigger configuration conflicts by adding consumers violating the unique username constraint. Admin API was responding with HTTP code 409. I'm leveraging this, although it's not documented in Admin API's OpenAPI definitions.

In general, it's hard to rely on deck to return any specific error therefore some error heuristics were needed.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

@czeslavo czeslavo temporarily deployed to Configure ci September 22, 2022 18:21 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 22, 2022 18:47 Inactive
@czeslavo czeslavo marked this pull request as ready for review September 23, 2022 08:08
@czeslavo czeslavo requested a review from a team as a code owner September 23, 2022 08:08
@czeslavo czeslavo temporarily deployed to Configure ci September 23, 2022 08:14 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 23, 2022 08:40 Inactive
Copy link
Contributor

@rainest rainest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this behave with these failures under database mode? Initial read suggests this will never mark a conflict with that, since it'll get back some sort of deck failure instead of any underlying go-kong failures.

We may not really be able to do anything yet given that the deck syncer doesn't always have the most useful errors, but a brief read indicates that we do get back an error array that we can walk to see if any are conflicts.

Copy link
Member

@pmalek pmalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the conflicts

internal/metrics/prometheus.go Show resolved Hide resolved
internal/dataplane/sendconfig/sendconfig_test.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 10:00 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 10:24 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 10:42 Inactive
Co-authored-by: Patryk Małek <patryk.malek@konghq.com>
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 10:46 Inactive
@czeslavo czeslavo marked this pull request as draft September 26, 2022 10:46
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 11:15 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:09 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:29 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:30 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:33 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:44 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:46 Inactive
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 21:46 Inactive
@czeslavo
Copy link
Contributor Author

How does this behave with these failures under database mode? Initial read suggests this will never mark a conflict with that, since it'll get back some sort of deck failure instead of any underlying go-kong failures.

We may not really be able to do anything yet given that the deck syncer doesn't always have the most useful errors, but a brief read indicates that we do get back an error array that we can walk to see if any are conflicts.

I've tested that with Postgres mode and you were right, in case of KongConsumer username conflict, the failure comes from deck's state.Get or dump.Get which would have been missed without some additional heuristics: 2fbcb50

I extended that with detecting network errors specifically as well as it was cheap enough.

@czeslavo czeslavo marked this pull request as ready for review September 26, 2022 21:59
@czeslavo czeslavo temporarily deployed to Configure ci September 26, 2022 22:11 Inactive
move changes back into 2.7, some rewrites.
@rainest rainest temporarily deployed to Configure ci September 26, 2022 22:28 Inactive
Copy link
Contributor

@rainest rainest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved changelog back to 2.7 with some minor edits. It's still stuck as I try and wait for E2E tests to slowly reveal if they're no longer dying due to environment failures. We already have a minor bump and this is pretty innocuous, so it may as well go out.

deck/format string changes look addressed, so 👍

@rainest rainest enabled auto-merge (squash) September 26, 2022 22:30
@rainest rainest temporarily deployed to Configure ci September 26, 2022 23:53 Inactive
@rainest rainest temporarily deployed to Configure ci September 27, 2022 00:15 Inactive
@rainest rainest merged commit ac4d48b into main Sep 27, 2022
@rainest rainest deleted the config-push-failure-label branch September 27, 2022 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants