From 4a80e580567c88ae54623c15047a60a9420b8cdd Mon Sep 17 00:00:00 2001 From: Hemant Kumar Date: Thu, 13 Jul 2017 13:45:38 -0400 Subject: [PATCH] Add a proposal for high level metrics This document adds a proposal for gathering metrics at operation level for volume operations. This ensures metrics can be captured regardless of individual volume plugin implementation. --- .../design-proposals/volume-metrics.md | 141 ++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 contributors/design-proposals/volume-metrics.md diff --git a/contributors/design-proposals/volume-metrics.md b/contributors/design-proposals/volume-metrics.md new file mode 100644 index 00000000000..79a45cb89c9 --- /dev/null +++ b/contributors/design-proposals/volume-metrics.md @@ -0,0 +1,141 @@ +# Volume operation metrics + +## Goal + +Capture high level metrics for various volume operations in Kubernetes. + +## Motivation + +Currently we don't have high level metrics that captures time taken +and success/failures rates of various volume operations. + +This proposal aims to implement capturing of these metrics at a level +higher than individual volume plugins. + +## Implementation + +### Metric format and collection + +Volume metrics emitted will fall under category of service metrics +as defined in [Kubernetes Monitoring Architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md). + + +The metrics will be emitted using [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and available for collection +from `/metrics` HTTP endpoint of kubelet and controller-manager. + + +Any collector which can parse Prometheus metric format should be able to collect +metrics from these endpoints. + +A more detailed description of monitoring pipeline can be found in [Monitoring architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md#monitoring-pipeline) document. + +### Metric Types + +Since we are interested in count(or rate) and time it takes to perform certain volume operation - we will use [Histogram](https://prometheus.io/docs/practices/histograms/) type for +emitting these metrics. + +We will be using `HistogramVec` type so as we can attach dimensions at runtime. All +the volume operation metrics will be named `storage_operation_duration_seconds`. +Name of operation and volume plugin's name will be emitted as dimensions. If for some reason +volume plugin's name is not available when operation is performed - label's value can be set +to ``. + + +We are also interested in count of volume operation failures and hence a metric of type `NewCounterVec` +will be used for keeping track of errors. The error metric will be similarly named `storage_operation_errors_total`. + +Following is a sample of metrics (not exhaustive) that will be added by this proposal: + + +``` +storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_attach" } +storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "volume_detach" } +storage_operation_duration_seconds { volume_plugin = "glusterfs", operation_name = "provision" } +storage_operation_duration_seconds { volume_plugin = "gce-pd", operation_name = "volume_delete" } +storage_operation_duration_seconds { volume_plugin = "vsphere", operation_name = "volume_mount" } +storage_operation_duration_seconds { volume_plugin = "iscsi" , operation_name = "volume_unmount" } +storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "mount_device" } +storage_operation_duration_seconds { volume_plugin = "aws-ebs", operation_name = "unmount_device" } +storage_operation_duration_seconds { volume_plugin = "cinder" , operation_name = "verify_volume" } +``` + +Similarly errors will be named: + +``` +storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_attach" } +storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "volume_detach" } +storage_operation_errors_total { volume_plugin = "glusterfs", operation_name = "provision" } +storage_operation_errors_total { volume_plugin = "gce-pd", operation_name = "volume_delete" } +storage_operation_errors_total { volume_plugin = "vsphere", operation_name = "volume_mount" } +storage_operation_errors_total { volume_plugin = "iscsi" , operation_name = "volume_unmount" } +storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "mount_device" } +storage_operation_errors_total { volume_plugin = "aws-ebs", operation_name = "unmount_device" } +storage_operation_errors_total { volume_plugin = "cinder" , operation_name = "verify_volume" } +``` + +### Implementation Detail + +We propose following changes as part of implementation details. + +1. All volume operations are executed via `goroutinemap.Run` or `nestedpendingoperations.Run`. +`Run` function interface of these two types can be changed to include a `operationComplete` callback argument. + + For example: + + ```go + // nestedpendingoperations.go + Run(v1.UniqueVolumeName, types.UniquePodName, func() error, opComplete func(error)) error + // goroutinemap + Run(string, func() error, opComplete func(error)) error + ``` + + This will enable us to know when a volume operation is complete. + +2. All `GenXXX` functions in `operation_generator.go` should return plugin name in addition to function and error. + + for example: + + ```go + GenerateMountVolumeFunc(waitForAttachTimeout time.Duration, + volumeToMount VolumeToMount, + actualStateOfWorldMounterUpdater + ActualStateOfWorldMounterUpdater, isRemount bool) (func() error, pluginName string, err error) + ``` + + Similarly `pv_controller.scheduleOperation` will take plugin name as additional parameter: + + ```go + func (ctrl *PersistentVolumeController) scheduleOperation( + operationName string, + pluginName string, + operation func() error) + ``` + +3. Above changes will enable us to gather required metrics in `operation_executor` or when scheduling a operation in +pv controller. + + For example, metrics for time it takes to attach Volume can be captured via: + + ```go + func operationExecutorHook(plugin, operationName string) func(error) { + requestTime := time.Now() + opComplete := func(err error) { + timeTaken := time.Since(requestTime).Seconds() + // Create metric with operation name and plugin name + } + return onComplete + } + attachFunc, plugin, err := + oe.operationGenerator.GenerateAttachVolumeFunc(volumeToAttach, actualStateOfWorld) + opCompleteFunc := operationExecutorHook(plugin, "volume_attach") + return oe.pendingOperations.Run( + volumeToAttach.VolumeName, "" /* podName */, attachFunc, opCompleteFunc) + ``` + + `operationExecutorHook` function is a hook that is registered in operation_executor and it will + initialize necessary metric params and will return a function. This will will be called when + operation is complete and will finalize metric creation and finally emit the metrics. + +### Conclusion + +Collection of metrics at operation level ensures almost no code change to volume plugin interface and a very minimum change to controllers.