Skip to content

Commit

Permalink
Gluster Metrics-MileStone 2
Browse files Browse the repository at this point in the history
tendrl-bug-id: Tendrl#188
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
  • Loading branch information
anmolbabu committed Aug 8, 2017
1 parent 19ca209 commit a4b7b7b
Showing 1 changed file with 229 additions and 0 deletions.
229 changes: 229 additions & 0 deletions specs/gluster-metrics-milestone2.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
// vim: tw=79

= Gluster Metrics

== Introduction

This specification deals with monitoring tendrl managed gluster cluster.
It explains the following:

* Which gluster metrics are to be enabled.
* How each of the metrics is going to be implemented.
* Any flows to be added to the node agent for configuring any collectd plugins
needed by the individual metrics.
This specification will deal with metrics implemented in milestone 2

== Problem description

Admin using tendrl for gluster cluster management, should be able to see an
easy to interpret representation of his gluster cluster/s with appropriate
utilization trend patterns and any issue in his gluster cluster/s should be
with no/minimal effort be made noticeable to the admin.

== Use Cases

A gluster cluster is imported(or created) in tendrl. From this point, this
cluster needs to be monitored for utilization and status behaviour.

== Proposed change

The gluster metrics will be enabled at the very level from collectd and any
aggregations thereon will be done by the grafana either on its own or using the
aggregation rules as communicated to it by the monitoring-integration module.

* Following are the gluster metrics to be enabled from collectd along with the
source of the information:
** Brick level:
*** FOP Latency -- gluster volume profile <vol_name> info --xml

xml attribute hierarchy in cli response:
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> avgLatency
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> minLatency
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> maxLatency

*** LVM Thin Pool metadata usage -- lvm vgs command
*** LVM thin pool data usage -- lvm vgs command

----
l = map(lambda x: dict(x),
map(lambda x: [e.split('=') for e in x],
map(lambda x: x.strip().split('$'), out)))
d = {}
for i in l:
if i['LVM2_LV_ATTR'][0] == 't':
k = "%s/%s" % (i['LVM2_VG_NAME'], i['LVM2_LV_NAME'])
else:
k = os.path.realpath(i['LVM2_LV_PATH'])
d.update({k: i})
return d

if lvs and device in lvs and \
lvs[device]['LVM2_LV_ATTR'][0] == 'V':
thinpool = "%s/%s" % (lvs[device]['LVM2_VG_NAME'],
lvs[device]['LVM2_POOL_LV'])
out['thinpool_size'] = float(
lvs[thinpool]['LVM2_LV_SIZE']) / 1024
out['thinpool_used_percent'] = float(
lvs[thinpool]['LVM2_DATA_PERCENT'])
out['metadata_size'] = float(
lvs[thinpool]['LVM2_LV_METADATA_SIZE']) / 1024
out['metadata_used_percent'] = float(
lvs[thinpool]['LVM2_METADATA_PERCENT'])
out['thinpool_free'] = out['thinpool_size'] * (
1 - out['thinpool_used_percent'] / 100.0)
out['thinpool_used'] = out['thinpool_size'] - out['thinpool_free']
out['metadata_free'] = out['metadata_size'] * (
1 - out['metadata_used_percent'] / 100.0)
out['metadata_used'] = out['metadata_size'] - out['metadata_free']
----

** Volume level:
*** Pending Heal -- gluster volume heal all statistics
*** Total Heal -- gluster volume heal all statistics
*** Heal pending Split Brain files
*** Rebalance status -- gluster get-state glusterd..
*** Geo-rep sessions count
**** Faulty -- gluster get-state -- slated for upcoming glusterfs builds
**** Ok -- gluster get-state -- slated for upcoming glusterfs builds
*** Number of connections -- gluster get-state
*** Snapshot count -- gluster get-state
*** Volume status(Degraded)
** Cluster level:
*** Quorum status
*** Degraded volume count
** Brick and cluster status wise counters can't be made available from the
level of collectd as only the brick info corresponding to node on which
the collectd plugin is currently in execution, will be accessible and
info regarding the other bricks in cluster is not available and plugins
are stateless. This if required will not be found and updated to
graphite by monitoring-integration.
* The plugin "tendrl_glusterfs_health_counters" in node-agent reads and parses
get-state o/p so, all counters will be added from this plugin.
* The plugin "tendrl_glusterfs_brick_utilization" currently pushes lvm thin
pool metadata and data usage but for some reason it is currently appearing
as None. Need to debug this as part of this milestone.
* The plugin "tendrl_glusterfs_profile_info" already uses volume profile info
and updates brick level io details. Now, it needs to also update the fop
stats which is also available via same command.
* A new plugin that executes gluster volume heal all statistics and parses it
to get heal stats needs to be added. This also needs to be configured from
node-agent import flow. The name of the plugin will be
"tendrl_glusterfs_heal_info"
* Volume degraded status -- To be explored..
* Cluster Quorum status -- To be explored..
* The aggregated/derived metrics that will be made available at grafana
level in accordance with rules configured by monitoring-integration are
briefed in:
----
https://github.com/Tendrl/specifications/issues/179
----

=== Alternatives

None

=== Data model impact:

The name-spacing of metrics will follow the following:

* tendrl.clusters.<cluster_id> will be the prefix for all metrics.
* Node level metrics follow the format:
tendrl.clusters.<cluster_id>.nodes.<node_name>.<plugin_name>.<plugin_attrs>
* Cluster level metrics follow the format:
tendrl.clusters.<cluster_id>.<plugin_name>.<plugin_attrs...>
* Volume level metrics follow the format:
tendrl.clusters.<cluster_id>.volumes.<volume_name>.<plugin_name>.
<plugin_attrs>
* Brick level metrics follow the format:
tendrl.clusters.<cluster_id>.volumes.<volume_name>.nodes.<node_name>.bricks.
<brick_path>.<plugin_name>.<plugin_attrs>
and the same would also be maintained @
tendrl.clusters.<cluster_id>.nodes.<node_name>.bricks.<brick_path>.
<plugin_name>.<plugin_attrs>
for mapping directly from node level.

=== Impacted Modules:

==== Tendrl API impact:

None

==== Notifications/Monitoring impact:

The configuration method would now change slightly in accordance with
details in "Proposed Change"

==== Tendrl/common impact:
None

==== Tendrl/node_agent impact:

* The collectd plugins and templates are maintained and packaged in node-agent
So, the changes described in "Proposed Changes" section belong to node-agent
* New plugin "tendrl_glusterfs_heal_info" to add heal info stats will be added
This also involves adding a template to configure the new plugin.

==== Sds integration impact:

None

=== Security impact:

None

=== Other end user impact:

The main consumer of this is the tendrl-grafana dashboard.
The impact in and due to the new dashboard will be detailed in a different
spec.

=== Performance impact:

None

=== Other deployer impact:

The configuration of collectd from monitoring-integration will slightly change
in terms of number of plugins to configure and the attributes to be passed in
configuration.

=== Developer impact:

* The aggreageations that were previously done by the performance-monitoring
application will now be done at grafana.
** The rules of aggregation will be communicated to grafana by
monitoring-integration. Details of this are out of scope of this spec
and will be covered as part of:

----
https://github.com/Tendrl/specifications/pull/218
----

== Implementation:


=== Assignee(s):

Primary assignee:
Metrics from collectd: anmolbabu<anmolbudugutta@gmail.com>
Dashboard: cloudbehl<cloudbehl@gmail.com>

=== Work Items:

https://github.com/Tendrl/node-agent/issues/575

== Dependencies:

None

== Testing:

The plugins push stats to graphite and the same in graphite will need to be
tested for correctness.

== Documentation impact:

None

== References:

0 comments on commit a4b7b7b

Please sign in to comment.