diff --git a/specs/gluster_metrics.adoc b/specs/gluster_metrics.adoc new file mode 100644 index 0000000..28e7dbb --- /dev/null +++ b/specs/gluster_metrics.adoc @@ -0,0 +1,282 @@ +// vim: tw=79 + += Gluster Metrics + +== Introduction + +This specification deals with monitoring tendrl managed gluster cluster. +It explains the following: + +* Which gluster metrics are to be enabled. +* How each of the metrics is going to be implemented. +* Any flows to be added to the node agent for configuring any collectd plugins + needed by the individual metrics. + + +== Problem description + +Admin using tendrl for gluster cluster management, should be able to see an +easy to interpret representation of his gluster cluster/s with appropriate +utilization trend patterns and any issue in his gluster cluster/s should be +with no/minimal effort be made noticeable to the admin. + +== Use Cases + +A gluster cluster is imported(or created) in tendrl. From this point, this +cluster needs to be monitored for utilization and status behaviour. + +== Proposed change + +The gluster metrics will be enabled at the very level from collectd and any +aggregations thereon will be done by the grafana either on its own or using the +aggregation rules as communicated to it by the monitoring-integration module. + +* Following are the gluster metrics to be enabled from collectd along with the + source of the information: + ** Brick level: + *** IOPS -- gluster volume profile info --xml + + xml attribute hierarchy in cli response: + cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead + cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite + iops <- totalRead + totalWrite + + *** FOP Latency -- gluster volume profile info --xml + + xml attribute hierarchy in cli response: + cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> avgLatency + cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> minLatency + cliOutput -> volProfile -> brick -> cumulativeStats -> fopStats -> fop -> maxLatency + + *** Brick Utilization -- python os.statvfs on brick path + + total -> data.f_blocks * data.f_bsize + free -> data.f_bfree * data.f_bsize + used_percent -> 100 - (100.0 * free / total) + + *** Inode utilization -- python os.statvfs on brick path + + free -> data.f_ffree + total -> data.f_files + used_percent_inode -> 100 - (100.0 * free_inode / total_inode) + + *** LVM Thin Pool metadata usage -- lvm vgs command + *** LVM thin pool data usage -- lvm vgs command + + ---- + l = map(lambda x: dict(x), + map(lambda x: [e.split('=') for e in x], + map(lambda x: x.strip().split('$'), out))) + d = {} + for i in l: + if i['LVM2_LV_ATTR'][0] == 't': + k = "%s/%s" % (i['LVM2_VG_NAME'], i['LVM2_LV_NAME']) + else: + k = os.path.realpath(i['LVM2_LV_PATH']) + d.update({k: i}) + return d + + if lvs and device in lvs and \ + lvs[device]['LVM2_LV_ATTR'][0] == 'V': + thinpool = "%s/%s" % (lvs[device]['LVM2_VG_NAME'], + lvs[device]['LVM2_POOL_LV']) + out['thinpool_size'] = float( + lvs[thinpool]['LVM2_LV_SIZE']) / 1024 + out['thinpool_used_percent'] = float( + lvs[thinpool]['LVM2_DATA_PERCENT']) + out['metadata_size'] = float( + lvs[thinpool]['LVM2_LV_METADATA_SIZE']) / 1024 + out['metadata_used_percent'] = float( + lvs[thinpool]['LVM2_METADATA_PERCENT']) + out['thinpool_free'] = out['thinpool_size'] * ( + 1 - out['thinpool_used_percent'] / 100.0) + out['thinpool_used'] = out['thinpool_size'] - out['thinpool_free'] + out['metadata_free'] = out['metadata_size'] * ( + 1 - out['metadata_used_percent'] / 100.0) + out['metadata_used'] = out['metadata_size'] - out['metadata_free'] + ---- + + ** Node level: + *** CPU -- collectd plugin + *** Memory -- collectd plugin + *** Storage(Mount point) -- collectd plugin + *** SWAP -- collectd plugin + *** Disk IOPS -- collectd plugin + *** Network + **** Bytes sent and received(rx, tx) -- collectd plugin + **** Dropped packets -- collectd plugin + **** Errors/overruns -- collectd plugin + ** Volume level: + *** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state + **** Counters will be made available via collectd plugins based on get-state result + *** Bricks counter + **** Total -- gluster get-state glusterd... + *** Pending Heal -- gluster volume heal all statistics + *** Rebalance status -- gluster get-state glusterd.. + *** Number of connections -- gluster volume status all clients --xml + ** Cluster level: + *** Quorum status + *** Number of nodes + **** Total -- gluster get-state glusterd... + **** Status-wise + Based on "peer in cluster" and "connected" in get-state output + *** Number of volumes + **** Total -- gluster get-state glusterd... + **** Status wise -- gluster get-state glusterd... + *** Number of bricks + **** Total -- gluster get-state glusterd... + ** Brick and cluster status wise counters can't be made available from the + level of collectd as only the brick info corresponding to node on which + the collectd plugin is currently in execution, will be accessible and + info regarding the other bricks in cluster is not available and plugins + are stateless. This if required will not be found and updated to + graphite by monitoring-integration. +* A new pluggable approach within our gluster metric collectors will be + developed such that the metrics collection plugins are loaded and + activated dynamically given just a one time configuration specifying + cluster_id and graphite host and port addresses.The plugins hooked to + gluster metrics provided pluggable framework are executed in concurrent as + green threads(being experimented) and hence their atomicity needs to be + ensured. + * The cluster topology is learnt once per execution cycle and the same + knowledge is shared between plugins unless the same can be fetched + by using a different means which is unavoidable. + ex: gluster volume status clients info --xml should be used to get + connections count but that already readily gives cluster topology + which can be used directly. + Note: + * Above approach runs every plugin in the framework at same interval. + * Attempts are being made to have a mechanism by which plugins can + register to our framework as light-weight, heavy-weight etc.. and + accordingly, the plugin execution interval will differ. + Note: + ** This is of particular importance in case of certain operations where + the command execution too frequently can affect the performance. +* A generic flow as in: + +---- +https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/flows/configure_collectd/__init__.py +---- + +and a genric command to generate config file from template as in: + +---- +https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py +---- + +will be added to node-agent. More details can be found at: + +---- +https://github.com/Tendrl/specifications/pull/219 +---- + +=== Alternatives + +None + +=== Data model impact: + +The name-spacing of metrics will follow the following: + +* tendrl.clusters. will be the prefix for all metrics. +* Node level metrics follow the format: + tendrl.clusters..nodes... +* Cluster level metrics follow the format: + tendrl.clusters... +* Volume level metrics follow the format: + tendrl.clusters..volumes... + +* Brick level metrics follow the format: + tendrl.clusters..volumes..nodes..bricks. + .. + and the same would also be maintained @ + tendrl.clusters..nodes..bricks.. + . + for mapping directly from node level. + +=== Impacted Modules: + +==== Tendrl API impact: + +None + +==== Notifications/Monitoring impact: + +The configuration method would now change slightly in accordance with +details in "Proposed Change" + +==== Tendrl/common impact: +None + +==== Tendrl/node_agent impact: + +None + +==== Sds integration impact: + +None + +=== Security impact: + +None + +=== Other end user impact: + +The main consumer of this is the tendrl-grafana dashboard. +The impact in and due to the new dashboard will be detailed in a different +spec. + +=== Performance impact: + +None + +=== Other deployer impact: + +The configuration of collectd from monitoring-integration will slightly change +in terms of number of plugins to configure and the attributes to be passed in +configuration. + +=== Developer impact: + +* The aggreageations that were previously done by the performance-monitoring + application will now be done at grafana. + ** The rules of aggregation will be communicated to grafana by + monitoring-integration. Details of this are out of scope of this spec + and will be covered as part of: + + ---- + https://github.com/Tendrl/specifications/pull/218 + ---- + +== Implementation: + + +=== Assignee(s): + +Primary assignee: + Metrics from collectd: anmolbabu + Dashboard: cloudbehl + +=== Work Items: + +Github issues will be raised and linked here once the repo that hosts the +different plugins are finalised. Specially from the view of merging of +node-monitoring and node-agent. + +== Dependencies: + +None + +== Testing: + +The plugins push stats to graphite and the same in graphite will need to be +tested for correctness. + +== Documentation impact: + +None + +== References: + +Attempts in this regard can be found @: +https://github.com/Tendrl/node-monitoring/pull/79