Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Warn if vaildator appears to be repairing excessively #33320

Closed

Conversation

jbiseda
Copy link
Contributor

@jbiseda jbiseda commented Sep 20, 2023

Problem

It is possible for validators to stay in sync with a cluster solely using repair. This is substantially less efficient than receiving data using turbine and most likely indicates network related misconfiguration. Misconfigured nodes generating substantial repair traffic skews metric data.

Summary of Changes

Maintain slot repair history and generate warnings if a node appears to be excessively repairing slots while indicating that it is "caught up" with the cluster.

Fixes #

@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch 2 times, most recently from 18e6e57 to b396244 Compare September 20, 2023 20:39
@codecov
Copy link

codecov bot commented Sep 20, 2023

Codecov Report

Merging #33320 (df3e5c2) into master (982d29c) will decrease coverage by 0.1%.
Report is 3 commits behind head on master.
The diff coverage is 86.4%.

@@            Coverage Diff            @@
##           master   #33320     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         806      806             
  Lines      217499   217660    +161     
=========================================
+ Hits       178019   178138    +119     
- Misses      39480    39522     +42     

@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch 5 times, most recently from 466a2e7 to ed49337 Compare September 22, 2023 23:16
@jbiseda jbiseda changed the title Warn if vaildator appears to repairing excessively Warn if vaildator appears to be repairing excessively Sep 22, 2023
@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch 4 times, most recently from 3e61009 to bc04854 Compare September 22, 2023 23:47
@jbiseda jbiseda marked this pull request as ready for review September 23, 2023 02:05
@t-nelson
Copy link
Contributor

mind breaking the plumbing out to a separate commit to ease review? same pr is fine, just wanna see the functional changes in isolation to i don't miss anything

@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch 2 times, most recently from 2549a13 to 062e8ec Compare September 25, 2023 19:29
@behzadnouri
Copy link
Contributor

The change seems overly intrusive (34 files, 347 loc) just to emit a warning, and most operators do not monitor logs anyways.
On our side, we already have the ability to identify nodes repairing excessively from the existing metrics as in #33166, which is more convenient than inspecting logs.

@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 2, 2023

The change seems overly intrusive (34 files, 347 loc) just to emit a warning, and most operators do not monitor logs anyways. On our side, we already have the ability to identify nodes repairing excessively from the existing metrics as in #33166, which is more convenient than inspecting logs.

Granted it touches lots of files, but it's pretty basic refactoring. Yes, we have the tools to investigate as in #33166, but it still requires an investigation. Inspecting logs is inconvenient for us, but more convenient for operators. If this reduces instances where we have to investigate I think it would be a win.

@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch from 062e8ec to b3937c8 Compare October 10, 2023 19:57
@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 10, 2023

Granted it touches lots of files, but it's pretty basic refactoring. Yes, we have the tools to investigate as in #33166, but it still requires an investigation. Inspecting logs is inconvenient for us, but more convenient for operators. If this reduces instances where we have to investigate I think it would be a win.

Do operators look at warnings in the log? @t-nelson ping @behzadnouri

@jbiseda jbiseda force-pushed the jbiseda/sustained-repair-warning branch from b3937c8 to df3e5c2 Compare October 10, 2023 20:56
@t-nelson
Copy link
Contributor

Do operators look at warnings in the log? @t-nelson ping @behzadnouri

does posting screenshots on discord constitute "looking"?

@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 10, 2023

Do operators look at warnings in the log? @t-nelson ping @behzadnouri

does posting screenshots on discord constitute "looking"?

I guess I'm asking if it's worth trying to help operators with log messages? 🤷

@behzadnouri
Copy link
Contributor

Granted it touches lots of files, but it's pretty basic refactoring. Yes, we have the tools to investigate as in #33166, but it still requires an investigation. Inspecting logs is inconvenient for us, but more convenient for operators. If this reduces instances where we have to investigate I think it would be a win.

Do operators look at warnings in the log? @t-nelson ping @behzadnouri

If the code change was small then adding some more logs was probably fine.
But in this case, my view is that we are not really getting enough value out of the added log message to justify the extra code or the maintenance burden incurred when changing the adjacent code in the future.

@t-nelson
Copy link
Contributor

considering only the second commit, it's not so bad. the first is big 'cause fmt wrapped a bunch of function calls

@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 13, 2023

The health status calculation is being changed by: #33651. I'll update this PR following that change.

@t-nelson
Copy link
Contributor

how many nodes are typically in a state that would be logging with this? how frequently?

we can already observe these nodes from metrics, right?

@jbiseda
Copy link
Contributor Author

jbiseda commented Oct 14, 2023

how many nodes are typically in a state that would be logging with this? how frequently?

Not a huge amount, maybe ~1 per week in the top 100. I'll look across the full set of validators.

we can already observe these nodes from metrics, right?

Yeah, we see a bump in repair, then find the outliers. I was hoping this would help eliminate some noise.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 30, 2023
@github-actions github-actions bot closed this Nov 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants