Skip to content

Commit

Permalink
Gluster metrics
Browse files Browse the repository at this point in the history
Gluster metrics

tendrl-bug-id: Tendrl#188
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
  • Loading branch information
anmolbabu committed Jul 26, 2017
1 parent 19ca209 commit 33a1f51
Showing 1 changed file with 217 additions and 0 deletions.
217 changes: 217 additions & 0 deletions specs/gluster_metrics.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
// vim: tw=79

= Gluster Metrics

== Introduction

This specification deals with monitoring tendrl managed gluster cluster.
It explains the following:

* Which gluster metrics are to be enabled.
* How each of the metrics is going to be implemented.
* The alert thresholds (defaults) in grafana for each metric.
* Any flows to be added to the node agent for configuring any collectd plugins
needed by the individual metrics.


== Problem description

Admin using tendrl for gluster cluster management, should be able to see an
easy to interpret representation of his gluster cluster/s with appropriate
utilization trend patterns and any issue in his gluster cluster/s should be
with no/minimal effort be made noticeable to the admin.

== Use Cases

A gluster cluster is imported(or created) in tendrl. From this point, this
cluster needs to be monitored for utilization and status behaviour.

== Proposed change

The gluster metrics will be enabled at the very level from collectd and any
aggregations thereon will be done by the grafana either on its own or using the
aggregation rules as communicated to it by the monitoring-integration module.

* Following are the gluster metrics to be enabled from collectd along with the
source of the information:
** Brick level:
*** IOPS -- gluster volume profile <vol_name> info --xml
*** FOP Latency -- gluster volume profile <vol_name> info --xml
*** LVM Thin Pool metadata usage -- lvm vgs command
*** LVM thin pool data usage -- lvm vgs command
** Node level:
*** CPU -- collectd plugin
*** Memory -- collectd plugin
*** Storage(Mount point) -- collectd plugin
*** SWAP -- collectd plugin
*** Disk IOPS -- collectd plugin
*** Network
**** Bytes sent and received(rx, tx) -- collectd plugin
**** Dropped packets -- collectd plugin
**** Errors/overruns -- collectd plugin
** Volume level:
*** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state
**** Counters will be made available via collectd plugins based on get-state result
*** Bricks counter
**** Total -- gluster get-state glusterd...
*** Pending Heal -- gluster volume heal all statistics
*** Rebalance status -- gluster get-state glusterd..
*** Number of connections -- gluster volume status all clients --xml
** Cluster level:
*** Status -- Custom logic
*** Quorum status --
*** Number of nodes
**** Total -- gluster get-state glusterd...
**** Status-wise --
*** Number of volumes
**** Total -- gluster get-state glusterd...
**** Status wise
*** Number of bricks
**** Total -- gluster get-state glusterd...
**** Staus wise
** Brick and cluster status wise counters can't be made available from the
level of collectd as only the brick info corresponding to node on which
the collectd plugin is currently in execution, will be accessible and
info regarding the other bricks in cluster is not available and plugins
are stateless. This if required will not be found and updated to
graphite by monitoring-integration.
* A new pluggable approach within our gluster metric collectors will be
developed such that the metrics collection plugins are loaded and
activated dynamically given just a one time configuration specifying
cluster_id and graphite host and port addresses.The plugins hooked to
gluster metrics provided pluggable framework are executed in concurrent as
green threads(being experimented) and hence their atomicity needs to be
ensured.
* The cluster topology is learnt once per execution cycle and the same
knowledge is shared between plugins unless the same can be fetched
by using a different means which is unavoidable.
ex: gluster volume status clients info --xml should be used to get
connections count but that already readily gives cluster topology
which can be used directly.
Note:
* Above approach runs every plugin in the framework at same interval.
* Attempts are being made to have a mechanism by which plugins can
register to our framework as light-weight, heavy-weight etc.. and
accordingly, the plugin execution interval will differ.
Note:
** This is of particular importance in case of certain operations where
the command execution too frequently can affect the performance.
* A generic flow as in:

----
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/flows/configure_collectd/__init__.py
----

and a genric command to generate config file from template as in:

----
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py
----

will be added to node-agent

=== Alternatives

None

=== Data model impact:

The name-spacing of metrics will follow the following:

* tendrl.clusters.<cluster_id> will be the prefix for all metrics.
* Node level metrics follow the format:
tendrl.clusters.<cluster_id>.nodes.<node_name>.<plugin_name>.<plugin_attrs>
* Cluster level metrics follow the format:
tendrl.clusters.<cluster_id>.<plugin_name>.<plugin_attrs...>
* Volume level metrics follow the format:
tendrl.clusters.<cluster_id>.volumes.<volume_name>.<plugin_name>.
<plugin_attrs>
* Brick level metrics follow the format:
tendrl.clusters.<cluster_id>.volumes.<volume_name>.nodes.<node_name>.bricks.
<brick_path>.<plugin_name>.<plugin_attrs>
and the same would also be maintained @
tendrl.clusters.<cluster_id>.nodes.<node_name>.bricks.<brick_path>.
<plugin_name>.<plugin_attrs>
for mapping directly from node level.

=== Impacted Modules:

==== Tendrl API impact:

None

==== Notifications/Monitoring impact:

The configuration method would now change slightly in accordance with
details in "Proposed Change"

==== Tendrl/common impact:
None

==== Tendrl/node_agent impact:

None

==== Sds integration impact:

None

=== Security impact:

None

=== Other end user impact:

The main consumer of this is the tendrl-grafana dashboard.
The impact in and due to the new dashboard will be detailed in a different
spec.

=== Performance impact:

None

=== Other deployer impact:

The configuration of collectd from monitoring-integration will slightly change
in terms of number of plugins to configure and the attributes to be passed in
configuration.

=== Developer impact:

* The aggreageations that were previously done by the performance-monitoring
application will now be done at grafana.
** The rules of aggregation will be communicated to grafana by
monitoring-integration. Details of this are out of scope of this spec

== Implementation:


=== Assignee(s):

Primary assignee:
Metrics from collectd: anmolbabu<anmolbudugutta@gmail.com>
Dashboard: cloudbehl<cloudbehl@gmail.com>

=== Work Items:

Github issues will be raised and linked here once the repo that hosts the
different plugins are finalised. Specially from the view of merging of
node-monitoring and node-agent.

== Dependencies:

None

== Testing:

The plugins push stats to graphite and the same in graphite will need to be
tested for correctness.

== Documentation impact:

None

== References:

Attempts in this regard can be found @:
https://github.com/Tendrl/node-monitoring/pull/79

0 comments on commit 33a1f51

Please sign in to comment.