Gluster metrics

Gluster metrics tendrl-bug-id: Tendrl#188 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
anmolbabu · Jul 26, 2017 · 33a1f51 · 33a1f51
1 parent 19ca209
commit 33a1f51
Showing 1 changed file with 217 additions and 0 deletions.
diff --git a/specs/gluster_metrics.adoc b/specs/gluster_metrics.adoc
@@ -0,0 +1,217 @@
+// vim: tw=79
+
+= Gluster Metrics
+
+== Introduction
+
+This specification deals with monitoring tendrl managed gluster cluster.
+It explains the following:
+
+* Which gluster metrics are to be enabled.
+* How each of the metrics is going to be implemented.
+* The alert thresholds (defaults) in grafana for each metric.
+* Any flows to be added to the node agent for configuring any collectd plugins
+  needed by the individual metrics.
+
+
+== Problem description
+
+Admin using tendrl for gluster cluster management, should be able to see an
+easy to interpret representation of his gluster cluster/s with appropriate
+utilization trend patterns and any issue in his gluster cluster/s should be
+with no/minimal effort be made noticeable to the admin.
+
+== Use Cases
+
+A gluster cluster is imported(or created) in tendrl. From this point, this
+cluster needs to be monitored for utilization and status behaviour.
+
+== Proposed change
+
+The gluster metrics will be enabled at the very level from collectd and any
+aggregations thereon will be done by the grafana either on its own or using the
+aggregation rules as communicated to it by the monitoring-integration module.
+
+* Following are the gluster metrics to be enabled from collectd along with the
+  source of the information:
+    ** Brick level:
+        *** IOPS -- gluster volume profile <vol_name> info --xml
+        *** FOP Latency -- gluster volume profile <vol_name> info --xml
+        *** LVM Thin Pool metadata usage -- lvm vgs command
+        *** LVM thin pool data usage -- lvm vgs command
+    ** Node level:
+        *** CPU -- collectd plugin
+        *** Memory -- collectd plugin
+        *** Storage(Mount point) -- collectd plugin
+        *** SWAP -- collectd plugin
+        *** Disk IOPS -- collectd plugin
+        *** Network
+            **** Bytes sent and received(rx, tx) -- collectd plugin
+            **** Dropped packets -- collectd plugin
+            **** Errors/overruns -- collectd plugin
+    ** Volume level:
+        *** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state
+            **** Counters will be made available via collectd plugins based on get-state result
+        *** Bricks counter
+            **** Total -- gluster get-state glusterd...
+        *** Pending Heal -- gluster volume heal all statistics
+        *** Rebalance status -- gluster get-state glusterd..
+        *** Number of connections -- gluster volume status all clients --xml
+    ** Cluster level:
+        *** Status -- Custom logic
+        *** Quorum status --
+        *** Number of nodes
+            **** Total -- gluster get-state glusterd...
+            **** Status-wise --
+        *** Number of volumes
+            **** Total -- gluster get-state glusterd...
+            **** Status wise
+        *** Number of bricks
+            **** Total -- gluster get-state glusterd...
+            **** Staus wise
+    ** Brick and cluster status wise counters can't be made  available from the
+       level of collectd as only the brick info corresponding to node on which
+       the collectd plugin is currently in execution, will be accessible and
+       info regarding the other bricks in cluster is not available and plugins
+       are stateless. This if required will not be found and updated to
+       graphite by monitoring-integration.
+* A new pluggable approach within our gluster metric collectors will be
+  developed such that the metrics collection plugins are loaded and
+  activated dynamically given just a one time configuration specifying
+  cluster_id and graphite host and port addresses.The plugins hooked to
+  gluster metrics provided pluggable framework are executed in concurrent as
+  green threads(being experimented) and hence their atomicity needs to be
+  ensured.
+    * The cluster topology is learnt once per execution cycle and the same
+      knowledge is shared between plugins unless the same can be fetched
+      by using a different means which is unavoidable.
+      ex: gluster volume status clients info --xml should be used to get
+      connections count but that already readily gives cluster topology
+      which can be used directly.
+    Note:
+        * Above approach runs every plugin in the framework at same interval.
+        * Attempts are being made to have a mechanism by which plugins can
+          register to our framework as light-weight, heavy-weight etc.. and
+          accordingly, the plugin execution interval will differ.
+        Note:
+        ** This is of particular importance in case of certain operations where
+           the command execution too frequently can affect the performance.
+* A generic flow as in:
+
+----
+https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/flows/configure_collectd/__init__.py
+----
+
+and a genric command to generate config file from template as in:
+
+----
+https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py
+----
+
+will be added to node-agent
+
+=== Alternatives
+
+None
+
+=== Data model impact:
+
+The name-spacing of metrics will follow the following:
+
+* tendrl.clusters.<cluster_id> will be the prefix for all metrics.
+* Node level metrics follow the format:
+   tendrl.clusters.<cluster_id>.nodes.<node_name>.<plugin_name>.<plugin_attrs>
+* Cluster level metrics follow the format:
+    tendrl.clusters.<cluster_id>.<plugin_name>.<plugin_attrs...>
+* Volume level metrics follow the format:
+    tendrl.clusters.<cluster_id>.volumes.<volume_name>.<plugin_name>.
+    <plugin_attrs>
+* Brick level metrics follow the format:
+    tendrl.clusters.<cluster_id>.volumes.<volume_name>.nodes.<node_name>.bricks.
+        <brick_path>.<plugin_name>.<plugin_attrs>
+    and the same would also be maintained @
+    tendrl.clusters.<cluster_id>.nodes.<node_name>.bricks.<brick_path>.
+        <plugin_name>.<plugin_attrs>
+    for mapping directly from node level.
+
+=== Impacted Modules:
+
+==== Tendrl API impact:
+
+None
+
+==== Notifications/Monitoring impact:
+
+The configuration method would now change slightly in accordance with
+details in "Proposed Change"
+
+==== Tendrl/common impact:
+None
+
+==== Tendrl/node_agent impact:
+
+None
+
+==== Sds integration impact:
+
+None
+
+=== Security impact:
+
+None
+
+=== Other end user impact:
+
+The main consumer of this is the tendrl-grafana dashboard.
+The impact in and due to the new dashboard will be detailed in a different
+spec.
+
+=== Performance impact:
+
+None
+
+=== Other deployer impact:
+
+The configuration of collectd from monitoring-integration will slightly change
+in terms of number of plugins to configure and the attributes to be passed in
+configuration.
+
+=== Developer impact:
+
+* The aggreageations that were previously done by the performance-monitoring
+  application will now be done at grafana.
+    ** The rules of aggregation will be communicated to grafana by
+       monitoring-integration. Details of this are out of scope of this spec
+
+== Implementation:
+
+
+=== Assignee(s):
+
+Primary assignee:
+  Metrics from collectd: anmolbabu<anmolbudugutta@gmail.com>
+  Dashboard: cloudbehl<cloudbehl@gmail.com>
+
+=== Work Items:
+
+Github issues will be raised and linked here once the repo that hosts the
+different plugins are finalised. Specially from the view of merging of
+node-monitoring and node-agent.
+
+== Dependencies:
+
+None
+
+== Testing:
+
+The plugins push stats to graphite and the same in graphite will need to be
+tested for correctness.
+
+== Documentation impact:
+
+None
+
+== References:
+
+Attempts in this regard can be found @:
+https://github.com/Tendrl/node-monitoring/pull/79