forked from Tendrl/specifications
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Gluster metrics tendrl-bug-id: Tendrl#188 Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
- Loading branch information
Showing
1 changed file
with
282 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,282 @@ | ||
// vim: tw=79 | ||
|
||
= Gluster Metrics | ||
|
||
== Introduction | ||
|
||
This specification deals with monitoring tendrl managed gluster cluster. | ||
It explains the following: | ||
|
||
* Which gluster metrics are to be enabled. | ||
* How each of the metrics is going to be implemented. | ||
* Any flows to be added to the node agent for configuring any collectd plugins | ||
needed by the individual metrics. | ||
|
||
|
||
== Problem description | ||
|
||
Admin using tendrl for gluster cluster management, should be able to see an | ||
easy to interpret representation of his gluster cluster/s with appropriate | ||
utilization trend patterns and any issue in his gluster cluster/s should be | ||
with no/minimal effort be made noticeable to the admin. | ||
|
||
== Use Cases | ||
|
||
A gluster cluster is imported(or created) in tendrl. From this point, this | ||
cluster needs to be monitored for utilization and status behaviour. | ||
|
||
== Proposed change | ||
|
||
The gluster metrics will be enabled at the very level from collectd and any | ||
aggregations thereon will be done by the grafana either on its own or using the | ||
aggregation rules as communicated to it by the monitoring-integration module. | ||
|
||
* Following are the gluster metrics to be enabled from collectd along with the | ||
source of the information: | ||
** Brick level: | ||
*** IOPS -- gluster volume profile <vol_name> info --xml | ||
|
||
xml attribute hierarchy in cli response: | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite | ||
iops <- totalRead + totalWrite | ||
|
||
*** FOP Latency -- gluster volume profile <vol_name> info --xml | ||
|
||
xml attribute hierarchy in cli response: | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> avgLatency | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> minLatency | ||
cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> maxLatency | ||
|
||
*** Brick Utilization -- python os.statvfs on brick path | ||
|
||
total -> data.f_blocks * data.f_bsize | ||
free -> data.f_bfree * data.f_bsize | ||
used_percent -> 100 - (100.0 * free / total) | ||
|
||
*** Inode utilization -- python os.statvfs on brick path | ||
|
||
free -> data.f_ffree | ||
total -> data.f_files | ||
used_percent_inode -> 100 - (100.0 * free_inode / total_inode) | ||
|
||
*** LVM Thin Pool metadata usage -- lvm vgs command | ||
*** LVM thin pool data usage -- lvm vgs command | ||
|
||
---- | ||
l = map(lambda x: dict(x), | ||
map(lambda x: [e.split('=') for e in x], | ||
map(lambda x: x.strip().split('$'), out))) | ||
d = {} | ||
for i in l: | ||
if i['LVM2_LV_ATTR'][0] == 't': | ||
k = "%s/%s" % (i['LVM2_VG_NAME'], i['LVM2_LV_NAME']) | ||
else: | ||
k = os.path.realpath(i['LVM2_LV_PATH']) | ||
d.update({k: i}) | ||
return d | ||
|
||
if lvs and device in lvs and \ | ||
lvs[device]['LVM2_LV_ATTR'][0] == 'V': | ||
thinpool = "%s/%s" % (lvs[device]['LVM2_VG_NAME'], | ||
lvs[device]['LVM2_POOL_LV']) | ||
out['thinpool_size'] = float( | ||
lvs[thinpool]['LVM2_LV_SIZE']) / 1024 | ||
out['thinpool_used_percent'] = float( | ||
lvs[thinpool]['LVM2_DATA_PERCENT']) | ||
out['metadata_size'] = float( | ||
lvs[thinpool]['LVM2_LV_METADATA_SIZE']) / 1024 | ||
out['metadata_used_percent'] = float( | ||
lvs[thinpool]['LVM2_METADATA_PERCENT']) | ||
out['thinpool_free'] = out['thinpool_size'] * ( | ||
1 - out['thinpool_used_percent'] / 100.0) | ||
out['thinpool_used'] = out['thinpool_size'] - out['thinpool_free'] | ||
out['metadata_free'] = out['metadata_size'] * ( | ||
1 - out['metadata_used_percent'] / 100.0) | ||
out['metadata_used'] = out['metadata_size'] - out['metadata_free'] | ||
---- | ||
|
||
** Node level: | ||
*** CPU -- collectd plugin | ||
*** Memory -- collectd plugin | ||
*** Storage(Mount point) -- collectd plugin | ||
*** SWAP -- collectd plugin | ||
*** Disk IOPS -- collectd plugin | ||
*** Network | ||
**** Bytes sent and received(rx, tx) -- collectd plugin | ||
**** Dropped packets -- collectd plugin | ||
**** Errors/overruns -- collectd plugin | ||
** Volume level: | ||
*** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state | ||
**** Counters will be made available via collectd plugins based on get-state result | ||
*** Bricks counter | ||
**** Total -- gluster get-state glusterd... | ||
*** Pending Heal -- gluster volume heal all statistics | ||
*** Rebalance status -- gluster get-state glusterd.. | ||
*** Number of connections -- gluster volume status all clients --xml | ||
** Cluster level: | ||
*** Quorum status | ||
*** Number of nodes | ||
**** Total -- gluster get-state glusterd... | ||
**** Status-wise | ||
Based on "peer in cluster" and "connected" in get-state output | ||
*** Number of volumes | ||
**** Total -- gluster get-state glusterd... | ||
**** Status wise -- gluster get-state glusterd... | ||
*** Number of bricks | ||
**** Total -- gluster get-state glusterd... | ||
** Brick and cluster status wise counters can't be made available from the | ||
level of collectd as only the brick info corresponding to node on which | ||
the collectd plugin is currently in execution, will be accessible and | ||
info regarding the other bricks in cluster is not available and plugins | ||
are stateless. This if required will not be found and updated to | ||
graphite by monitoring-integration. | ||
* A new pluggable approach within our gluster metric collectors will be | ||
developed such that the metrics collection plugins are loaded and | ||
activated dynamically given just a one time configuration specifying | ||
cluster_id and graphite host and port addresses.The plugins hooked to | ||
gluster metrics provided pluggable framework are executed in concurrent as | ||
green threads(being experimented) and hence their atomicity needs to be | ||
ensured. | ||
* The cluster topology is learnt once per execution cycle and the same | ||
knowledge is shared between plugins unless the same can be fetched | ||
by using a different means which is unavoidable. | ||
ex: gluster volume status clients info --xml should be used to get | ||
connections count but that already readily gives cluster topology | ||
which can be used directly. | ||
Note: | ||
* Above approach runs every plugin in the framework at same interval. | ||
* Attempts are being made to have a mechanism by which plugins can | ||
register to our framework as light-weight, heavy-weight etc.. and | ||
accordingly, the plugin execution interval will differ. | ||
Note: | ||
** This is of particular importance in case of certain operations where | ||
the command execution too frequently can affect the performance. | ||
* A generic flow as in: | ||
|
||
---- | ||
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/flows/configure_collectd/__init__.py | ||
---- | ||
|
||
and a genric command to generate config file from template as in: | ||
|
||
---- | ||
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py | ||
---- | ||
|
||
will be added to node-agent. More details can be found at: | ||
|
||
---- | ||
https://github.com/Tendrl/specifications/pull/219 | ||
---- | ||
|
||
=== Alternatives | ||
|
||
None | ||
|
||
=== Data model impact: | ||
|
||
The name-spacing of metrics will follow the following: | ||
|
||
* tendrl.clusters.<cluster_id> will be the prefix for all metrics. | ||
* Node level metrics follow the format: | ||
tendrl.clusters.<cluster_id>.nodes.<node_name>.<plugin_name>.<plugin_attrs> | ||
* Cluster level metrics follow the format: | ||
tendrl.clusters.<cluster_id>.<plugin_name>.<plugin_attrs...> | ||
* Volume level metrics follow the format: | ||
tendrl.clusters.<cluster_id>.volumes.<volume_name>.<plugin_name>. | ||
<plugin_attrs> | ||
* Brick level metrics follow the format: | ||
tendrl.clusters.<cluster_id>.volumes.<volume_name>.nodes.<node_name>.bricks. | ||
<brick_path>.<plugin_name>.<plugin_attrs> | ||
and the same would also be maintained @ | ||
tendrl.clusters.<cluster_id>.nodes.<node_name>.bricks.<brick_path>. | ||
<plugin_name>.<plugin_attrs> | ||
for mapping directly from node level. | ||
|
||
=== Impacted Modules: | ||
|
||
==== Tendrl API impact: | ||
|
||
None | ||
|
||
==== Notifications/Monitoring impact: | ||
|
||
The configuration method would now change slightly in accordance with | ||
details in "Proposed Change" | ||
|
||
==== Tendrl/common impact: | ||
None | ||
|
||
==== Tendrl/node_agent impact: | ||
|
||
None | ||
|
||
==== Sds integration impact: | ||
|
||
None | ||
|
||
=== Security impact: | ||
|
||
None | ||
|
||
=== Other end user impact: | ||
|
||
The main consumer of this is the tendrl-grafana dashboard. | ||
The impact in and due to the new dashboard will be detailed in a different | ||
spec. | ||
|
||
=== Performance impact: | ||
|
||
None | ||
|
||
=== Other deployer impact: | ||
|
||
The configuration of collectd from monitoring-integration will slightly change | ||
in terms of number of plugins to configure and the attributes to be passed in | ||
configuration. | ||
|
||
=== Developer impact: | ||
|
||
* The aggreageations that were previously done by the performance-monitoring | ||
application will now be done at grafana. | ||
** The rules of aggregation will be communicated to grafana by | ||
monitoring-integration. Details of this are out of scope of this spec | ||
and will be covered as part of: | ||
|
||
---- | ||
https://github.com/Tendrl/specifications/pull/218 | ||
---- | ||
|
||
== Implementation: | ||
|
||
|
||
=== Assignee(s): | ||
|
||
Primary assignee: | ||
Metrics from collectd: anmolbabu<anmolbudugutta@gmail.com> | ||
Dashboard: cloudbehl<cloudbehl@gmail.com> | ||
|
||
=== Work Items: | ||
|
||
Github issues will be raised and linked here once the repo that hosts the | ||
different plugins are finalised. Specially from the view of merging of | ||
node-monitoring and node-agent. | ||
|
||
== Dependencies: | ||
|
||
None | ||
|
||
== Testing: | ||
|
||
The plugins push stats to graphite and the same in graphite will need to be | ||
tested for correctness. | ||
|
||
== Documentation impact: | ||
|
||
None | ||
|
||
== References: | ||
|
||
Attempts in this regard can be found @: | ||
https://github.com/Tendrl/node-monitoring/pull/79 |