Gluster metrics #206

anmolbabu · 2017-07-24T11:03:21Z

Gluster metrics

tendrl-bug-id: #188
Signed-off-by: anmolbabu anmolbudugutta@gmail.com

anmolbabu · 2017-07-24T11:10:24Z

This is WIP spec with some unknowns:
status wise counters for bricks:
This brick status is availble in get-state for the brick belonging to the node on which the get-state command is fired.. So there's no way collectd residing on a node can push this counter. The best that can be done from collectd's perspective is to push brick status as a integral number(as ceph-metrics does) to graphite and a way to aggregate this at grafana needs to be found..
Note This is not a mere count query. As the actual status is number encoded in the metric value and not in metric name.. Encoding it in metric name makes no sense as when the status changes, the previous metric and the current metric both co-exist leading to unnecessary confusions

@cloudbehl Please add the grafana side of things either to the same spec or to a different one..

@r0h4n @brainfunked @shtripat @nthomas-redhat @anivargi Please review

shtripat · 2017-07-24T11:09:10Z

specs/gluster_metrics.adoc

+
+== Problem description
+
+Admin using tendrl for gluster cluster management, should be able to see the


s/see the a easy/see an easy/

shtripat · 2017-07-24T11:13:16Z

specs/gluster_metrics.adoc

+            **** Dropped packets -- collectd plugin
+            **** Errors/overruns -- collectd plugin
+    ** Volume level:
+        *** Status -- gluster get-state glusterd odir /var/run file glusterd-state


When you say get-state do you mean sds sycn from gluster-integration or collectd as well runs this every few seconds?

Collectd also runs this. So I believe its better to use a different file for get-state o/p something like:
gluster get-state glusterd odir /var/run file collectd-glusterd-state
So that the files read and deleted by gluster-integration and collectd will differ and they don't end-up in a race condition

cloudbehl · 2017-07-24T11:40:14Z

specs/gluster_metrics.adoc

+      Failure: 90
+    brick_utilization:
+      Warning: 75
+      Failure: 90


@rishubhjain @GowthamShanmugam I think these threshold values you will need to while implementing grafana's alerting.

Yeah, the thresholds will be part of Grafana configuration.

anmolbabu · 2017-07-26T09:33:15Z

#213 (comment)

shtripat

LGTM. No specific comments.
@Tendrl/tendrl-core plz review.

nthomas-redhat

I understand that lot of aggregation is done at grafana level. But we need capture them here and clearly state that it will be handled at grafana

nthomas-redhat · 2017-08-01T03:45:23Z

specs/gluster_metrics.adoc

+
+            xml attribute hierarchy in cli response:
+            cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead
+            cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite


I understand that read and write collected but please mention how IOPS is calculated

nthomas-redhat · 2017-08-01T03:48:40Z

specs/gluster_metrics.adoc

+            xml attribute hierarchy in cli response:
+            cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead
+            cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite
+


How the drill down from brick to disk is addressed? Are you planning to collect the disk level iostat or at a mount point level? If the brick is created on RAID, then how this is addressed? please provide the details here

Drill down is something we need to figure out yet and it is slated for implementation in subsequent drops.. It is most likely that the relation from brick to its disk needs to be communicated to grafana by monitoring-integration.The details of how this happens is not yet found out @anivargi can you please add if I am missing something
As of now what we have is :

Disk Plugin(https://collectd.org/wiki/index.php/Plugin:Disk) that gives performance statistics(iops and io time) of hard-disks and, where supported, partitions.

df plugin(https://collectd.org/wiki/index.php/Plugin:DF) that gives file system usage information, i. e. basically how much space on a mounted partition is used and how much is available.
The iostat plugin is not packaged with collectd I will need to read more in this regard..

Also, the spec only currently deals with whatever is available today and for the upcoming immediate drop.. The spec is expected to evolve as we progress..

nthomas-redhat · 2017-08-01T03:52:40Z

specs/gluster_metrics.adoc

+        Note:
+        ** This is of particular importance in case of certain operations where
+           the command execution too frequently can affect the performance.
+* A generic flow as in:


You can link #219 here

nthomas-redhat · 2017-08-01T03:54:00Z

specs/gluster_metrics.adoc

+* The aggreageations that were previously done by the performance-monitoring
+  application will now be done at grafana.
+    ** The rules of aggregation will be communicated to grafana by
+       monitoring-integration. Details of this are out of scope of this spec


Can you link to as spec/issue if that exists?

nthomas-redhat · 2017-08-01T03:57:31Z

specs/gluster_metrics.adoc

+                 Based on "peer in cluster" and "connected" in get-state output
+        *** Number of volumes
+            **** Total -- gluster get-state glusterd...
+            **** Status wise -- gluster get-state glusterd...


Regarding metrics or entities that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible via grafana.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi

this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

nthomas-redhat · 2017-08-01T03:58:30Z

specs/gluster_metrics.adoc

+       the collectd plugin is currently in execution, will be accessible and
+       info regarding the other bricks in cluster is not available and plugins
+       are stateless. This if required will not be found and updated to
+       graphite by monitoring-integration.


These are required, can grafana do this?

Nope. We need some other module(like monitoring-integration) to be able to do this

Why grafana can't do this?

Because the individual brick statuses are mapped to integeral values and grafana I am not sure if it can look at the value of metric and classify them based on values and count each of the classifications @cloudbehl is this facilitated by grafana??

nthomas-redhat · 2017-08-01T04:03:08Z

specs/gluster_metrics.adoc

+        *** Pending Heal -- gluster volume heal all statistics
+        *** Rebalance status -- gluster get-state glusterd..
+        *** Number of connections -- gluster volume status all clients --xml
+    ** Cluster level:


Please mention how the these metrics are collected/calculated(grafana?)

Cluster utilization

cluster iops, latency

Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)
4.Network stats

Hosts with high memory and cpu utilization

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

Regarding hosts that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi

This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

nthomas-redhat · 2017-08-01T04:07:13Z

specs/gluster_metrics.adoc

+            **** Bytes sent and received(rx, tx) -- collectd plugin
+            **** Dropped packets -- collectd plugin
+            **** Errors/overruns -- collectd plugin
+    ** Volume level:


utilization?
iops/latency
Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

nthomas-redhat · 2017-08-01T04:08:28Z

specs/gluster_metrics.adoc

+        *** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state
+            **** Counters will be made available via collectd plugins based on get-state result
+        *** Bricks counter
+            **** Total -- gluster get-state glusterd...


brick status wise counts?

Collectd plugin does not have access to statuses of bricks from other nodes(get-state provide brick status of brick from the node where it is currently executed). So, monitoring-integration which will have access to etcd(where gluster-integrations from all gluster nodes sync their respective brick statuses to etcd) can push this stat to graphite as an exception

@nthomas-redhat @anivargi @brainfunked Please suggest

I agree. As part of perfirmance improvements, the similar thing is planned for utilization, status, connections aggregations in gluster sds sync as well. IN that case we would have a separate thread which executes a flow once cluster is imported/created to sync the aggregated details and schedules it to run at regular intervals post that.
Something similar would be required here as well I feel.

@anmolbabu , what is discussed earlier was to collect individual status per node and aggregate at grafana right?

@nthomas-redhat Yes that stands to be the case even now.
Only thing is the aggregation logic like for example, brick utilization to volume utilization aggregation should not ideally be a mere summation of utilization of all bricks but instead this differs based on volume type.. So these agregation logic need to be communicated to grafana from monitoring-integration..

nthomas-redhat · 2017-08-01T04:11:18Z

specs/gluster_metrics.adoc

+        *** Memory -- collectd plugin
+        *** Storage(Mount point) -- collectd plugin
+        *** SWAP -- collectd plugin
+        *** Disk IOPS -- collectd plugin


Is this individual disk level or aggregate for node like iops, latency, throghput, nearful etc
Also are we planning to collect brick status wise counts here?

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

Agreed, but this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

nnDarshan · 2017-08-01T06:40:34Z

specs/gluster_metrics.adoc

+    ** Node level:
+        *** CPU -- collectd plugin
+        *** Memory -- collectd plugin
+        *** Storage(Mount point) -- collectd plugin


We should probably decide on set of predefined mounts that we need to monitor right ? like /root, /boot, /home.

Why??

I feel we can expose all data but allow UI to decide what it wants to show.. And if we limit the mount-points to be monitored, there is every chance we loose out some or many important mount points created from cli(ex: volume mounts, disk mounts etc..)..

Otherwise the moment we detect even some of these mount points via any tendrl service the collectd config file responsible for mount point stats will need to reconfigured and collectd needs to be restarted every time on any(all) nodes where ever this happens..

only concern with monitoring all mount points on the node is that all the glusterfs bricks are mount points and we are monitoring them as part of brick utilization collection. if we monitor all mount points we will be repeating the monitoring of gluster brick mount points.

I acknowledge this.. But then if monitoring-integration can bury this info in templates that it communicates to grafana, I would prefer that rather than limiting collectd to monitor only a pre-defined set of mount points.. The reason is as I said, if we ever dynamically discover that we need a new mount-point to be monitored, that involves the following:

collectd config change

restart collectd to reflect config change to it.
and this on every node where the change happened..

nnDarshan · 2017-08-01T06:45:02Z

specs/gluster_metrics.adoc

+            **** Total -- gluster get-state glusterd...
+        *** Pending Heal -- gluster volume heal all statistics
+        *** Rebalance status -- gluster get-state glusterd..
+        *** Number of connections -- gluster volume status all clients --xml


I guess this is a costly operation. However get-state gives you per brick client info that is present in that particular node. I feel that can be used, what do you think ?

I don't see this currently from get-state o/p is it planned for some-time soon??

This is available in latest version(this was added recently). you can check gluster nightly builds

nnDarshan · 2017-08-01T07:07:00Z

specs/gluster_metrics.adoc

+            **** Bytes sent and received(rx, tx) -- collectd plugin
+            **** Dropped packets -- collectd plugin
+            **** Errors/overruns -- collectd plugin
+    ** Volume level:


Hey, isn't snapshot count for a volume, number of geo-rep sessions for a volume (with ok and faulty count) needed

Hmm I think we narrowed the scope to https://docs.google.com/spreadsheets/d/1X2JREn0TybSqLHgx9vcv-zjn4S9X9_Xz0JdhADew5-Q/edit?ts=5965bec8#gid=1819780204 and I missed this but that shouldn't be too difficult to add from get-state o/p.
I'll add them..

nnDarshan · 2017-08-01T09:24:44Z

Looks good to me

anmolbabu · 2017-08-08T08:10:20Z

@r0h4n @nthomas-redhat @shtripat @brainfunked As discussedI have trimmed this spec to deal with milestone1 part
I'll raise Gluster Metrics-milestone 2 once this is merged
Please review..

shtripat · 2017-08-08T08:17:42Z