Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak dashboard retries functionality for 401/403 errors #22652

Closed
dkwon17 opened this issue Nov 2, 2023 · 5 comments
Closed

Tweak dashboard retries functionality for 401/403 errors #22652

dkwon17 opened this issue Nov 2, 2023 · 5 comments
Assignees
Labels
area/dashboard kind/enhancement A feature request - must adhere to the feature request template. severity/P1 Has a major impact to usage or development of the system.

Comments

@dkwon17
Copy link
Contributor

dkwon17 commented Nov 2, 2023

Is your enhancement related to a problem? Please describe

Dashboard retries feature has been implemented to retry requests that result in 401/403 errors in the dashboard:

Here are the results of this fix in the dogfooding cluster. To summarize, out of 118 failing requests, ~94% of those requests have succeeded thanks to retries.

My dashboard refresh script was able to reproduce the error for a few hours on Oct 31:
image

According to the metrics, out of about 2849 refreshes to the dashboard (maybe this is too much, I reduced the refresh frequency to 60 times every 10 mins) there were about 118 requests that required a retry. Note that the sample size (118) is quite small because the 401/403 issues are rare on the dogfooding cluster.

Out of 118 requests, 111 of them (~94%) succeeded thanks to the retries:

  • 92 requests succeeded after 1 retry
  • 14 requests succeeded after 2 retries
  • 2 requests succeeded after 3 retries
  • 3 requests succeeded after 4 retries (**)

Out of 118 requests, 7 of them (~6%) failed despite the retries:

  • 2 requests failed after 3 retries
  • 5 requests failed after 7 retries (**)

(**) retries greater than 3 is unique to the provision request. The provision request is currently being retried up to 7 times because the provision request is also being retried in testBackends()

Describe the solution you'd like

The 94% success rate for requests that would have otherwise all failed is nice. Some ideas to further tweak the retries to improve the percentage could be:

  • increase the number of retries to 4 (code)
  • have an exponential delay between requests, rather than a fixed delay (code)

Describe alternatives you've considered

No response

Additional context

No response

@dkwon17 dkwon17 added the kind/enhancement A feature request - must adhere to the feature request template. label Nov 2, 2023
@dkwon17
Copy link
Contributor Author

dkwon17 commented Nov 2, 2023

cc @ibuziuk @tolusha

@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 2, 2023
@dkwon17 dkwon17 added severity/P1 Has a major impact to usage or development of the system. area/dashboard and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Nov 2, 2023
@tolusha
Copy link
Contributor

tolusha commented Nov 3, 2023

@dkwon17
Thank you for detailed investigation.
What are those 2 requests that failed after 3 retries?

@dkwon17
Copy link
Contributor Author

dkwon17 commented Nov 3, 2023

@tolusha no problem, the two requests are:

  • dashboard/api/namespace/dkwon17-che/devworkspaces (one instance of failure after 3 retries)
  • dashboard/api/userprofile/dkwon17-che (one instance of failure after 3 retries)

@ibuziuk
Copy link
Member

ibuziuk commented Nov 6, 2023

@dkwon17 wdyt, can we include the fix for 7.77.x for 3.10?

@dkwon17
Copy link
Contributor Author

dkwon17 commented Nov 6, 2023

@ibuziuk that works for me, I'll try to have it for 7.77.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dashboard kind/enhancement A feature request - must adhere to the feature request template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

4 participants