-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gluster metrics #206
Gluster metrics #206
Conversation
This is WIP spec with some unknowns: @cloudbehl Please add the grafana side of things either to the same spec or to a different one.. @r0h4n @brainfunked @shtripat @nthomas-redhat @anivargi Please review |
specs/gluster_metrics.adoc
Outdated
|
||
== Problem description | ||
|
||
Admin using tendrl for gluster cluster management, should be able to see the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/see the a easy/see an easy/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
**** Dropped packets -- collectd plugin | ||
**** Errors/overruns -- collectd plugin | ||
** Volume level: | ||
*** Status -- gluster get-state glusterd odir /var/run file glusterd-state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say get-state
do you mean sds sycn from gluster-integration
or collectd as well runs this every few seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collectd also runs this. So I believe its better to use a different file for get-state o/p something like:
gluster get-state glusterd odir /var/run file collectd-glusterd-state
So that the files read and deleted by gluster-integration and collectd will differ and they don't end-up in a race condition
f730f5b
to
07a54ba
Compare
specs/gluster_metrics.adoc
Outdated
Failure: 90 | ||
brick_utilization: | ||
Warning: 75 | ||
Failure: 90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rishubhjain @GowthamShanmugam I think these threshold values you will need to while implementing grafana's alerting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the thresholds will be part of Grafana configuration.
33a1f51
to
41d0815
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No specific comments.
@Tendrl/tendrl-core plz review.
41d0815
to
252f32e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that lot of aggregation is done at grafana level. But we need capture them here and clearly state that it will be handled at grafana
specs/gluster_metrics.adoc
Outdated
|
||
xml attribute hierarchy in cli response: | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that read and write collected but please mention how IOPS is calculated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
xml attribute hierarchy in cli response: | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the drill down from brick to disk is addressed? Are you planning to collect the disk level iostat or at a mount point level? If the brick is created on RAID, then how this is addressed? please provide the details here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drill down is something we need to figure out yet and it is slated for implementation in subsequent drops.. It is most likely that the relation from brick to its disk needs to be communicated to grafana by monitoring-integration.The details of how this happens is not yet found out @anivargi can you please add if I am missing something
As of now what we have is :
- Disk Plugin(https://collectd.org/wiki/index.php/Plugin:Disk) that gives performance statistics(iops and io time) of hard-disks and, where supported, partitions.
- df plugin(https://collectd.org/wiki/index.php/Plugin:DF) that gives file system usage information, i. e. basically how much space on a mounted partition is used and how much is available.
The iostat plugin is not packaged with collectd I will need to read more in this regard..
Also, the spec only currently deals with whatever is available today and for the upcoming immediate drop.. The spec is expected to evolve as we progress..
specs/gluster_metrics.adoc
Outdated
Note: | ||
** This is of particular importance in case of certain operations where | ||
the command execution too frequently can affect the performance. | ||
* A generic flow as in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can link #219 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
* The aggreageations that were previously done by the performance-monitoring | ||
application will now be done at grafana. | ||
** The rules of aggregation will be communicated to grafana by | ||
monitoring-integration. Details of this are out of scope of this spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to as spec/issue if that exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
Based on "peer in cluster" and "connected" in get-state output | ||
*** Number of volumes | ||
**** Total -- gluster get-state glusterd... | ||
**** Status wise -- gluster get-state glusterd... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
near full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding metrics or entities that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible via grafana.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
the collectd plugin is currently in execution, will be accessible and | ||
info regarding the other bricks in cluster is not available and plugins | ||
are stateless. This if required will not be found and updated to | ||
graphite by monitoring-integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are required, can grafana do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. We need some other module(like monitoring-integration) to be able to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why grafana can't do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the individual brick statuses are mapped to integeral values and grafana I am not sure if it can look at the value of metric and classify them based on values and count each of the classifications @cloudbehl is this facilitated by grafana??
specs/gluster_metrics.adoc
Outdated
*** Pending Heal -- gluster volume heal all statistics | ||
*** Rebalance status -- gluster get-state glusterd.. | ||
*** Number of connections -- gluster volume status all clients --xml | ||
** Cluster level: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention how the these metrics are collected/calculated(grafana?)
- Cluster utilization
- cluster iops, latency
- Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)
4.Network stats - Hosts with high memory and cpu utilization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..
All these are not slated for upcoming drop and hence the spec will evolve to contain these information..
Regarding hosts that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
**** Bytes sent and received(rx, tx) -- collectd plugin | ||
**** Dropped packets -- collectd plugin | ||
**** Errors/overruns -- collectd plugin | ||
** Volume level: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utilization?
iops/latency
Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..
All these are not slated for upcoming drop and hence the spec will evolve to contain these information..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
specs/gluster_metrics.adoc
Outdated
*** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state | ||
**** Counters will be made available via collectd plugins based on get-state result | ||
*** Bricks counter | ||
**** Total -- gluster get-state glusterd... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brick status wise counts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collectd plugin does not have access to statuses of bricks from other nodes(get-state provide brick status of brick from the node where it is currently executed). So, monitoring-integration which will have access to etcd(where gluster-integrations from all gluster nodes sync their respective brick statuses to etcd) can push this stat to graphite as an exception
@nthomas-redhat @anivargi @brainfunked Please suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. As part of perfirmance improvements, the similar thing is planned for utilization, status, connections aggregations in gluster sds sync as well. IN that case we would have a separate thread which executes a flow once cluster is imported/created to sync the aggregated details and schedules it to run at regular intervals post that.
Something similar would be required here as well I feel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anmolbabu , what is discussed earlier was to collect individual status per node and aggregate at grafana right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nthomas-redhat Yes that stands to be the case even now.
Only thing is the aggregation logic like for example, brick utilization to volume utilization aggregation should not ideally be a mere summation of utilization of all bricks but instead this differs based on volume type.. So these agregation logic need to be communicated to grafana from monitoring-integration..
specs/gluster_metrics.adoc
Outdated
*** Memory -- collectd plugin | ||
*** Storage(Mount point) -- collectd plugin | ||
*** SWAP -- collectd plugin | ||
*** Disk IOPS -- collectd plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this individual disk level or aggregate for node like iops, latency, throghput, nearful etc
Also are we planning to collect brick status wise counts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..
All these are not slated for upcoming drop and hence the spec will evolve to contain these information..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
252f32e
to
c4780be
Compare
specs/gluster_metrics.adoc
Outdated
** Node level: | ||
*** CPU -- collectd plugin | ||
*** Memory -- collectd plugin | ||
*** Storage(Mount point) -- collectd plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably decide on set of predefined mounts that we need to monitor right ? like /root, /boot, /home.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why??
I feel we can expose all data but allow UI to decide what it wants to show.. And if we limit the mount-points to be monitored, there is every chance we loose out some or many important mount points created from cli(ex: volume mounts, disk mounts etc..)..
Otherwise the moment we detect even some of these mount points via any tendrl service the collectd config file responsible for mount point stats will need to reconfigured and collectd needs to be restarted every time on any(all) nodes where ever this happens..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only concern with monitoring all mount points on the node is that all the glusterfs bricks are mount points and we are monitoring them as part of brick utilization collection. if we monitor all mount points we will be repeating the monitoring of gluster brick mount points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I acknowledge this.. But then if monitoring-integration can bury this info in templates that it communicates to grafana, I would prefer that rather than limiting collectd to monitor only a pre-defined set of mount points.. The reason is as I said, if we ever dynamically discover that we need a new mount-point to be monitored, that involves the following:
- collectd config change
- restart collectd to reflect config change to it.
and this on every node where the change happened..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
c4780be
to
d124ef2
Compare
specs/gluster_metrics.adoc
Outdated
**** Total -- gluster get-state glusterd... | ||
*** Pending Heal -- gluster volume heal all statistics | ||
*** Rebalance status -- gluster get-state glusterd.. | ||
*** Number of connections -- gluster volume status all clients --xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a costly operation. However get-state gives you per brick client info that is present in that particular node. I feel that can be used, what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this currently from get-state o/p is it planned for some-time soon??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is available in latest version(this was added recently). you can check gluster nightly builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
specs/gluster_metrics.adoc
Outdated
**** Bytes sent and received(rx, tx) -- collectd plugin | ||
**** Dropped packets -- collectd plugin | ||
**** Errors/overruns -- collectd plugin | ||
** Volume level: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, isn't snapshot count for a volume, number of geo-rep sessions for a volume (with ok and faulty count) needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think we narrowed the scope to https://docs.google.com/spreadsheets/d/1X2JREn0TybSqLHgx9vcv-zjn4S9X9_Xz0JdhADew5-Q/edit?ts=5965bec8#gid=1819780204 and I missed this but that shouldn't be too difficult to add from get-state o/p.
I'll add them..
d124ef2
to
4e7bcdc
Compare
Looks good to me |
4e7bcdc
to
6e34587
Compare
@r0h4n @nthomas-redhat @shtripat @brainfunked As discussedI have trimmed this spec to deal with milestone1 part |
and a genric command to generate config file from template as in: | ||
|
||
---- | ||
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the name of the binary as its tendrl_monitoring_config_manager.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing them from here they are already briefed in https://github.com/Tendrl/specifications/blob/master/specs/refactoring_of_node_monitoring_flows_into_node_agent.adoc#proposed-change
|
||
==== Notifications/Monitoring impact: | ||
|
||
The configuration method would now change slightly in accordance with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more details here or Proposed Change
section has enough details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing them from here they are already briefed in https://github.com/Tendrl/specifications/blob/master/specs/refactoring_of_node_monitoring_flows_into_node_agent.adoc#proposed-change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
|
||
=== Other deployer impact: | ||
|
||
The configuration of collectd from monitoring-integration will slightly change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above because The configuration of collectd from monitoring-integration will slightly change
creates confusion I feel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes removing it as this configuration happens internally and requires no user intervention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Can we have some details of these internal configurations.
|
||
=== Work Items: | ||
|
||
Github issues will be raised and linked here once the repo that hosts the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is available now as monitoring-integration
. Can we add specific issues here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done I just needed to link issue on node-agent Tendrl/node-agent#549
Gluster metrics tendrl-bug-id: Tendrl#188 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
6e34587
to
e27841f
Compare
Note: | ||
The aggregated/derived metrics that will be made available at grafana | ||
level will be dealt with in detail by spec issues: | ||
* https://github.com/Tendrl/specifications/issues/179 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible could you provide a link instead of text?
Gluster metrics
tendrl-bug-id: #188
Signed-off-by: anmolbabu anmolbudugutta@gmail.com