Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No alerts for when capacity utilization threshold breached #194

Closed
julienlim opened this issue Oct 11, 2017 · 13 comments
Closed

No alerts for when capacity utilization threshold breached #194

julienlim opened this issue Oct 11, 2017 · 13 comments

Comments

@julienlim
Copy link
Member

julienlim commented Oct 11, 2017

I redeployed tendrl using tendrl-ansible from master just now, and I created a near full condition on one of my bricks, and no alerts to say that threshold was breach on node 3:/bricks/brick1 were generated.

I also noticed that in the Tendrl UI, the chart shown the brick utilization shows blue (which may confuse user who sees it differently in the Grafana dashboards since it is not matching the thresholds colors used in the Host dashboards.

screen shot 2017-10-11 at 4 23 15 pm

screen shot 2017-10-11 at 4 21 00 pm

screen shot 2017-10-11 at 4 21 29 pm

screen shot 2017-10-11 at 4 23 54 pm

screen shot 2017-10-11 at 4 27 11 pm

@cloudbehl @gnehapk @a2batic @Tendrl/tendrl-qe @mcarrano

@rishubhjain
Copy link
Contributor

@cloudbehl Please check the issue.

@cloudbehl
Copy link
Member

@GowthamShanmugam Can you verify why the bricks alert is generated when the threshold is breached.

@gnehapk Is the color coding added in the progress bar if the utilization exceeds 80% it should be yellow and when it exceeds 90% it should be Red.

@GowthamShanmugam
Copy link
Collaborator

@cloudbehl verified, it works fine, need to merge this small PR to receive volume alerts from grafana #217

@rishubhjain
Copy link
Contributor

@julienlim Please verify whether the issue is still reproducible or not.

@GowthamShanmugam
Copy link
Collaborator

@rishubhjain need this patch to be merged for this it will solve all alert count and alert filter and grafana volume alert catch

#219
Tendrl/gluster-integration#453
Tendrl/commons#766
Tendrl/node-agent#657

@GowthamShanmugam
Copy link
Collaborator

@julienlim This issue is fixed, please verify this

@julienlim
Copy link
Member Author

julienlim commented Nov 3, 2017

@GowthamShanmugam @rishubhjain @cloudbehl @gnehapk Verified that when I caused one of the bricks threshold to breach that an alert was generated.
screen shot 2017-11-03 at 1 15 34 pm

Observing that the alerts counter on hosts is not correctly synchronizing with the alerts (as node2 should have 2 vs. 1 alert).

screen shot 2017-11-03 at 1 16 15 pm

I checked the vol1 (which node2:/bricks/brick1 is on that breached), the vol1 alert counter was incremented to 1.
screen shot 2017-11-03 at 1 16 31 pm

My environment:

  • Tendrl 1.5.4 deployed
  • Tendrl server on node0
  • Tendrl node-agents on node1, node2, node3 -- node1-3 are in the same Gluster cluster

I filled up node2:/bricks/brick1 to over 90% to trigger a threshold breach. Threshold alert did occur. Other alerts for node1, node2, node3 also triggered based on memory utilization above 80%.

@Tendrl/tendrl-qe

@GowthamShanmugam
Copy link
Collaborator

GowthamShanmugam commented Nov 4, 2017

@julienlim memory utilization node2 is very near to max threshold breach of 80.4 so it might be replaced with the clear event before you check alert count, because it may come less than 80 before you check, please past the notification UI screen also. then we can verify easily.

Because i experienced this lot of time, please share notifier screen, then we can identify easily

@julienlim
Copy link
Member Author

@GowthamShanmugam

Here's the alerts & notification screenshot along with host list:

screen shot 2017-11-06 at 9 17 20 am

screen shot 2017-11-06 at 9 17 30 am

@GowthamShanmugam
Copy link
Collaborator

@Tendrl/tendrl-core is brick utilization threshold breach also should come under node alert count? I have added only brick down in node alert counts.

i am sure brick utilization threshold breach should not come under volume alert count, is it related necessary to add in node_alert count ?

@shtripat
Copy link
Member

@nthomas-redhat @brainfunked @r0h4n plz comment.
I feel its good to have node level alert count if brick utilization breached threshold.

@r0h4n
Copy link
Contributor

r0h4n commented Nov 17, 2017

I would suggest we track the new change requests in a separate issue.

@r0h4n r0h4n closed this as completed Nov 17, 2017
@julienlim
Copy link
Member Author

@r0h4n Can you please list what the GitHub issues are so I can track them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants