From 41d4cb8c6b9f2c17acd1ecd7eee3b948bbd53c7c Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Mon, 31 Jul 2023 16:54:03 -0700 Subject: [PATCH 1/7] Add backward compatibility issues to doc --- docs/metrics.md | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/docs/metrics.md b/docs/metrics.md index 9993948683..249065edfb 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -10,6 +10,7 @@ * [Custom Metrics API](#custom-metrics-api) * [Logging custom metrics](#log-custom-metrics) * [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example) +* [Backwards compatibility warnings](#backwards-compatibility-warnings) ## Introduction @@ -28,7 +29,7 @@ Metrics are collected by default at the following locations in `log` mode: The location of log files and metric files can be configured in the [log4j2.xml](https://github.com/pytorch/serve/blob/master/frontend/server/src/main/resources/log4j2.xml) file -In `prometheus` mode, all metrics are made available in prometheus format via the [metrics](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) API endpoint. +In `prometheus` mode, all metrics are made available in prometheus format via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md). ## Frontend Metrics @@ -187,7 +188,10 @@ model_metrics: # backend metrics ``` -Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited. +Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. + +Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\ +When adding custom `model_metrics` in the metrics cofiguration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the custom metrics API. ### How it works @@ -622,3 +626,23 @@ class CustomHandlerExample: # except this time with gauge metric type object metrics.add_size("GaugeModelMetricNameExample", 42.5) ``` + +## Backwards compatibility warnings +1. Starting [v0.6.1](https://github.com/pytorch/serve/releases/tag/v0.6.1), the `add_metric` API signature changed\ + from [add_metric(name, value, unit, idx=None, dimensions=None)](https://github.com/pytorch/serve/blob/61f1c4182e6e864c9ef1af99439854af3409d325/ts/metrics/metrics_store.py#L184)\ + to [add_metric(metric_name, unit, dimension_names, metric_type)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272).\ + Usage of the new API is shown [above](#specifying-metric-types).\ + There are two approaches available when migrating to the new custom metrics API: + - Replace the call to `add_metric` in versions prior to v0.6.1 with calls to the following methods: + ``` + metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) + metric1.add_or_update(value, dimension_values=["value1", "value2", ...]) + ``` + - Replace the call to `add_metric` in versions prior to v0.6.1 with one of the suitable custom metrics APIs where applicable: [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics), + [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics) +2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml)) + are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) + based on the `metrics_mode` configuration as described [above](#introduction).\ + The default `metrics_mode` is `log` mode.\ + This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds` + which were only available via the metrics API endpoint. From e26d9205e4beb54a1ca83aeda09355228f966739 Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Fri, 4 Aug 2023 00:46:14 -0700 Subject: [PATCH 2/7] Update example for custom metrics --- docs/metrics.md | 26 ++- examples/custom_metrics/README.md | 205 ++++++++++++------ examples/custom_metrics/config.properties | 12 + examples/custom_metrics/metrics.yaml | 97 +++++++++ examples/custom_metrics/mnist_handler.py | 83 ++++++- .../custom_metrics/torchserve_custom.mtail | 24 -- 6 files changed, 335 insertions(+), 112 deletions(-) create mode 100644 examples/custom_metrics/config.properties create mode 100644 examples/custom_metrics/metrics.yaml delete mode 100644 examples/custom_metrics/torchserve_custom.mtail diff --git a/docs/metrics.md b/docs/metrics.md index 249065edfb..7ddde3dbd7 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -191,7 +191,8 @@ model_metrics: # backend metrics Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\ -When adding custom `model_metrics` in the metrics cofiguration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the custom metrics API. +When adding custom `model_metrics` in the metrics cofiguration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: [add_counter](#add-counter-based-metrics), +[add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics). ### How it works @@ -377,7 +378,7 @@ Add time-based by invoking the following method: Function API ```python - def add_time(self, metric_name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None, + def add_time(self, name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None, metric_type: MetricTypes = MetricTypes.GAUGE): """ Add a time based metric like latency, default unit is 'ms' @@ -385,7 +386,7 @@ Function API Parameters ---------- - metric_name : str + name : str metric name value: int value of metric @@ -422,7 +423,7 @@ Add size-based metrics by invoking the following method: Function API ```python - def add_size(self, metric_name: str, value: int or float, idx=None, unit: str = 'MB', dimensions: list = None, + def add_size(self, name: str, value: int or float, idx=None, unit: str = 'MB', dimensions: list = None, metric_type: MetricTypes = MetricTypes.GAUGE): """ Add a size based metric @@ -430,7 +431,7 @@ Function API Parameters ---------- - metric_name : str + name : str metric name value: int, float value of metric @@ -467,7 +468,7 @@ Percentage based metrics can be added by invoking the following method: Function API ```python - def add_percent(self, metric_name: str, value: int or float, idx=None, dimensions: list = None, + def add_percent(self, name: str, value: int or float, idx=None, dimensions: list = None, metric_type: MetricTypes = MetricTypes.GAUGE): """ Add a percentage based metric @@ -475,7 +476,7 @@ Function API Parameters ---------- - metric_name : str + name : str metric name value: int, float value of metric @@ -489,6 +490,8 @@ Function API ``` +**Inferred unit**: `percent` + To add custom percentage-based metrics: ```python @@ -507,14 +510,13 @@ Counter based metrics can be added by invoking the following method Function API ```python - def add_counter(self, metric_name: str, value: int or float, idx=None, dimensions: list = None, - metric_type: MetricTypes = MetricTypes.COUNTER): + def add_counter(self, name: str, value: int or float, idx=None, dimensions: list = None): """ Add a counter metric or increment an existing counter metric Default metric type is counter Parameters ---------- - metric_name : str + name : str metric name value: int or float value of metric @@ -522,11 +524,11 @@ Function API request_id index in batch dimensions: list list of dimensions for the metric - metric_type: MetricTypes - type for defining different operations, defaulted to counter metric type for Counter metrics """ ``` +**Inferred unit**: `count` + ### Getting a metric Users can get a metric from the cache. The Metric object is returned, so the user can access the methods of the Metric: (i.e. `Metric.update(value)`, `Metric.__str__`) diff --git a/examples/custom_metrics/README.md b/examples/custom_metrics/README.md index 149a71cf8d..6a199c5f41 100644 --- a/examples/custom_metrics/README.md +++ b/examples/custom_metrics/README.md @@ -1,93 +1,164 @@ -# Monitoring Torchserve custom metrics with mtail metrics exporter and prometheus +# Torchserve custom metrics with prometheus support -In this example, we show how to use a pre-trained custom MNIST model and export the custom metrics using mtail and prometheus +In this example, we show how to use a pre-trained custom MNIST model and export custom metrics using prometheus. -We used the following pytorch example to train the basic MNIST model for digit recognition : https://github.com/pytorch/examples/tree/master/mnist +We use the following pytorch example of MNIST model for digit recognition : https://github.com/pytorch/examples/tree/master/mnist -Run the commands given in following steps from the parent directory of the root of the repository. For example, if you cloned the repository into /home/my_path/serve, run the steps from /home/my_path +Run the commands given in following steps from the root directory of the repository. For example, if you cloned the repository into /home/my_path/serve, run the steps from /home/my_path/serve ## Steps -- Step 1: In this example we introduce a new custom metric `SizeOfImage` in the custom handler and export it using mtail. +- Step 1: In this example we add the following custom metrics and access them in prometheus format via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md): + - InferenceRequestCount + - PostprocessCallCount + - RequestBatchSize + - SizeOfImage + - HandlerMethodTime + - ExamplePercentMetric - ```python - def preprocess(self, data): - metrics = self.context.metrics - input = data[0].get('body') - metrics.add_size('SizeOfImage', len(input) / 1024, None, 'kB') - return ImageClassifier.preprocess(self, data) - ``` + The custom metrics configuration file `metrics.yaml` in this example builds on top of the [default metrics configuration file](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) to include the custom metrics listed above. + The `config.properties` file in this example configures torchserve to use the custom metrics configuration file and sets the `metrics_mode` to `prometheus`. The custom handler + `mnist_handler.py` updates the metrics listed above. - Refer: [Custom Metrics](https://github.com/pytorch/serve/blob/master/docs/metrics.md#custom-metrics-api) + Refer: [Custom Metrics](https://github.com/pytorch/serve/blob/master/docs/metrics.md#custom-metrics-api)\ Refer: [Custom Handler](https://github.com/pytorch/serve/blob/master/docs/custom_service.md#custom-handlers) -- Step 2: Create a torch model archive using the torch-model-archiver utility to archive the above files. +- Step 2: Create a torch model archive using the torch-model-archiver utility. ```bash torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/mnist.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler examples/custom_metrics/mnist_handler.py ``` -- Step 3: Register the model on TorchServe using the above model archive file. +- Step 3: Register the model to torchserve using the above model archive file. ```bash mkdir model_store mv mnist.mar model_store/ - torchserve --start --model-store model_store --models mnist=mnist.mar + torchserve --ncs --start --model-store model_store --ts-config examples/custom_metrics/config.properties --models mnist=mnist.mar ``` -- Step 4: Install [mtail](https://github.com/google/mtail/releases) - - ```bash - wget https://github.com/google/mtail/releases/download/v3.0.0-rc47/mtail_3.0.0-rc47_Linux_x86_64.tar.gz - tar -xvzf mtail_3.0.0-rc47_Linux_x86_64.tar.gz - chmod +x mtail - ``` - -- Step 5: Create a mtail program. In this example we using a program to export default custom metrics. - - Refer: [mtail Programming Guide](https://google.github.io/mtail/Programming-Guide.html). - -- Step 6: Start mtail export by running the below command - - ```bash - ./mtail --progs examples/custom_metrics/torchserve_custom.mtail --logs logs/model_metrics.log - ``` - - The mtail program parses the log file extracts info by matching patterns and presents as JSON, Prometheus and other databases. https://google.github.io/mtail/Interoperability.html - -- Step 7: Make Inference request +- Step 4: Make Inference request ```bash curl http://127.0.0.1:8080/predictions/mnist -T examples/image_classifier/mnist/test_data/0.png ``` - The inference request logs the time taken for prediction to the model_metrics.log file. - Mtail parses the file and is served at 3903 port - - `http://localhost:3903` - -- Step 8: Sart Prometheus with mtailtarget added to scarpe config - - - Download [Prometheus](https://prometheus.io/download/) - - - Add mtail target added to scrape config in the config file - - ```yaml - scrape_configs: - # The job name is added as a label `job=` to any timeseries scraped from this config. - - job_name: "prometheus" - - # metrics_path defaults to '/metrics' - # scheme defaults to 'http'. - - static_configs: - - targets: ["localhost:9090", "localhost:3903"] - ``` - - - Start Prometheus with config file - - ```bash - ./prometheus --config.file prometheus.yml - ``` - - The exported logs from mtail are scraped by prometheus on 3903 port. +- Step 5: Install prometheus using the instructions [here](https://prometheus.io/download/#prometheus). + +- Step 6: Create a minimal `prometheus.yaml` config file as below and run `./prometheus --config.file=prometheus.yaml`. + +```yaml +global: + scrape_interval: 15s + evaluation_interval: 15s + +scrape_configs: + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + - job_name: 'torchserve' + static_configs: + - targets: ['localhost:8082'] #TorchServe metrics endpoint +``` + +- Step 7: Test metrics API endpoint +```console +curl http://127.0.0.1:8082/metrics + +# HELP PredictionTime Torchserve prometheus gauge metric with unit: ms +# TYPE PredictionTime gauge +PredictionTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 23.3 +# HELP GPUMemoryUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE GPUMemoryUtilization gauge +# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds +# TYPE ts_queue_latency_microseconds counter +ts_queue_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 164.607 +# HELP WorkerLoadTime Torchserve prometheus gauge metric with unit: Milliseconds +# TYPE WorkerLoadTime gauge +WorkerLoadTime{WorkerName="W-9000-mnist_1.0",Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 5818.0 +# HELP SizeOfImage Torchserve prometheus gauge metric with unit: kB +# TYPE SizeOfImage gauge +SizeOfImage{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 0.265625 +# HELP PostprocessCallCount Torchserve prometheus counter metric with unit: count +# TYPE PostprocessCallCount counter +PostprocessCallCount{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE GPUUtilization gauge +# HELP Requests5XX Torchserve prometheus counter metric with unit: Count +# TYPE Requests5XX counter +# HELP HandlerMethodTime Torchserve prometheus gauge metric with unit: ms +# TYPE HandlerMethodTime gauge +HandlerMethodTime{MethodName="preprocess",ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 13.740777969360352 +# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes +# TYPE MemoryAvailable gauge +MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 4584.23828125 +# HELP InferenceRequestCount Torchserve prometheus counter metric with unit: count +# TYPE InferenceRequestCount counter +InferenceRequestCount{Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP ts_inference_requests_total Torchserve prometheus counter metric with unit: Count +# TYPE ts_inference_requests_total counter +ts_inference_requests_total{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP HandlerTime Torchserve prometheus gauge metric with unit: ms +# TYPE HandlerTime gauge +HandlerTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 23.17 +# HELP ExamplePercentMetric Torchserve prometheus histogram metric with unit: percent +# TYPE ExamplePercentMetric histogram +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.005",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.01",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.025",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.05",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.075",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.1",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.25",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.5",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.75",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="1.0",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="2.5",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="5.0",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="7.5",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="10.0",} 0.0 +ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="+Inf",} 1.0 +ExamplePercentMetric_count{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 +ExamplePercentMetric_sum{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 50.0 +# HELP WorkerThreadTime Torchserve prometheus gauge metric with unit: Milliseconds +# TYPE WorkerThreadTime gauge +WorkerThreadTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 3.0 +# HELP Requests2XX Torchserve prometheus counter metric with unit: Count +# TYPE Requests2XX counter +Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP QueueTime Torchserve prometheus gauge metric with unit: Milliseconds +# TYPE QueueTime gauge +QueueTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0 +# HELP MemoryUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE MemoryUtilization gauge +MemoryUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 72.0 +# HELP GPUMemoryUsed Torchserve prometheus gauge metric with unit: Megabytes +# TYPE GPUMemoryUsed gauge +# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds +# TYPE ts_inference_latency_microseconds counter +ts_inference_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 26736.381 +# HELP DiskAvailable Torchserve prometheus gauge metric with unit: Gigabytes +# TYPE DiskAvailable gauge +DiskAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 306.9124526977539 +# HELP RequestBatchSize Torchserve prometheus gauge metric with unit: count +# TYPE RequestBatchSize gauge +RequestBatchSize{ModelName="mnist",Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes +# TYPE DiskUsage gauge +DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.438858032226562 +# HELP Requests4XX Torchserve prometheus counter metric with unit: Count +# TYPE Requests4XX counter +# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes +# TYPE MemoryUsed gauge +MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8699.48046875 +# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE CPUUtilization gauge +CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 33.3 +# HELP DiskUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE DiskUtilization gauge +DiskUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 2.7 +``` + +- Step 8: Navigate to `http://localhost:9090/` on a browser to execute queries and create graphs + +Screenshot 2023-08-03 at 6 46 47 PM diff --git a/examples/custom_metrics/config.properties b/examples/custom_metrics/config.properties new file mode 100644 index 0000000000..02607ac36d --- /dev/null +++ b/examples/custom_metrics/config.properties @@ -0,0 +1,12 @@ +metrics_mode=prometheus +metrics_config=examples/custom_metrics/metrics.yaml +models={\ + "mnist": {\ + "1.0": {\ + "defaultVersion": true,\ + "marName": "mnist.mar",\ + "minWorkers": 1,\ + "maxWorkers": 1\ + }\ + }\ +} diff --git a/examples/custom_metrics/metrics.yaml b/examples/custom_metrics/metrics.yaml new file mode 100644 index 0000000000..699ca19aea --- /dev/null +++ b/examples/custom_metrics/metrics.yaml @@ -0,0 +1,97 @@ +dimensions: + - &model_name "ModelName" + - &worker_name "WorkerName" + - &level "Level" + - &device_id "DeviceId" + - &hostname "Hostname" + +ts_metrics: + counter: + - name: Requests2XX + unit: Count + dimensions: [*level, *hostname] + - name: Requests4XX + unit: Count + dimensions: [*level, *hostname] + - name: Requests5XX + unit: Count + dimensions: [*level, *hostname] + - name: ts_inference_requests_total + unit: Count + dimensions: ["model_name", "model_version", "hostname"] + - name: ts_inference_latency_microseconds + unit: Microseconds + dimensions: ["model_name", "model_version", "hostname"] + - name: ts_queue_latency_microseconds + unit: Microseconds + dimensions: ["model_name", "model_version", "hostname"] + gauge: + - name: QueueTime + unit: Milliseconds + dimensions: [*level, *hostname] + - name: WorkerThreadTime + unit: Milliseconds + dimensions: [*level, *hostname] + - name: WorkerLoadTime + unit: Milliseconds + dimensions: [*worker_name, *level, *hostname] + - name: CPUUtilization + unit: Percent + dimensions: [*level, *hostname] + - name: MemoryUsed + unit: Megabytes + dimensions: [*level, *hostname] + - name: MemoryAvailable + unit: Megabytes + dimensions: [*level, *hostname] + - name: MemoryUtilization + unit: Percent + dimensions: [*level, *hostname] + - name: DiskUsage + unit: Gigabytes + dimensions: [*level, *hostname] + - name: DiskUtilization + unit: Percent + dimensions: [*level, *hostname] + - name: DiskAvailable + unit: Gigabytes + dimensions: [*level, *hostname] + - name: GPUMemoryUtilization + unit: Percent + dimensions: [*level, *device_id, *hostname] + - name: GPUMemoryUsed + unit: Megabytes + dimensions: [*level, *device_id, *hostname] + - name: GPUUtilization + unit: Percent + dimensions: [*level, *device_id, *hostname] + +model_metrics: + # Dimension "Hostname" is automatically added for model metrics in the backend + counter: + - name: InferenceRequestCount + unit: count + dimensions: [] + - name: PostprocessCallCount + unit: count + dimensions: [*model_name, *level] + gauge: + - name: HandlerTime + unit: ms + dimensions: [*model_name, *level] + - name: PredictionTime + unit: ms + dimensions: [*model_name, *level] + - name: RequestBatchSize + unit: count + dimensions: ["ModelName"] + - name: SizeOfImage + unit: kB + dimensions: [*model_name, *level] + - name: HandlerMethodTime + unit: ms + dimensions: ["MethodName", *model_name, *level] + histogram: + - name: ExamplePercentMetric + unit: percent + dimensions: [*model_name, *level] diff --git a/examples/custom_metrics/mnist_handler.py b/examples/custom_metrics/mnist_handler.py index 919b3a8f83..db162d753d 100644 --- a/examples/custom_metrics/mnist_handler.py +++ b/examples/custom_metrics/mnist_handler.py @@ -1,6 +1,9 @@ -import io -from PIL import Image +import time + from torchvision import transforms + +from ts.metrics.dimension import Dimension +from ts.metrics.metric_type_enum import MetricTypes from ts.torch_handler.image_classifier import ImageClassifier @@ -8,13 +11,29 @@ class MNISTDigitClassifier(ImageClassifier): """ MNISTDigitClassifier handler class. This handler extends class ImageClassifier from image_classifier.py, a default handler. This handler takes an image and returns the number in that image. - - Here method postprocess() has been overridden while others are reused from parent class. """ image_processing = transforms.Compose( - [transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,))]) + [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] + ) + + def initialize(self, context): + super().initialize(context) + metrics = context.metrics + + # Usage of "add_metric" + self.inf_request_count = metrics.add_metric( + metric_name="InferenceRequestCount", + unit="count", + dimension_names=[], + metric_type=MetricTypes.COUNTER, + ) + metrics.add_metric( + metric_name="RequestBatchSize", + unit="count", + dimension_names=["ModelName"], + metric_type=MetricTypes.GAUGE, + ) def preprocess(self, data): """ @@ -27,10 +46,43 @@ def preprocess(self, data): Returns: tensor: Returns the tensor data of the input """ + preprocess_start = time.time() + metrics = self.context.metrics - input = data[0].get('body') - metrics.add_size('SizeOfImage', len(input) / 1024, None, 'kB') - return ImageClassifier.preprocess(self, data) + + # Usage of "add_or_update" + self.inf_request_count.add_or_update(value=1, dimension_values=[]) + + # Usage of "get_metric" + request_batch_size_metric = metrics.get_metric( + metric_name="RequestBatchSize", metric_type=MetricTypes.GAUGE + ) + request_batch_size_metric.add_or_update( + value=len(data), dimension_values=[self.context.model_name] + ) + + input = data[0].get("body") + + # Usage of "add_size" + metrics.add_size( + name="SizeOfImage", value=len(input) / 1024, idx=None, unit="kB" + ) + + preprocessed_image = ImageClassifier.preprocess(self, data) + + preprocess_stop = time.time() + + # usage of add_time + metrics.add_time( + name="HandlerMethodTime", + value=(preprocess_stop - preprocess_start) * 1000, + idx=None, + unit="ms", + dimensions=[Dimension(name="MethodName", value="preprocess")], + metric_type=MetricTypes.GAUGE, + ) + + return preprocessed_image def postprocess(self, data): """The post process of MNIST converts the predicted output response to a label. @@ -41,4 +93,17 @@ def postprocess(self, data): Returns: list : A list of dictionary with predictons and explanations are returned. """ + # usage of add_counter + self.context.metrics.add_counter( + name="PostprocessCallCount", value=1, idx=None, dimensions=[] + ) + # usage of add_percent + self.context.metrics.add_percent( + name="ExamplePercentMetric", + value=50, + idx=None, + dimensions=[], + metric_type=MetricTypes.HISTOGRAM, + ) + return data.argmax(1).flatten().tolist() diff --git a/examples/custom_metrics/torchserve_custom.mtail b/examples/custom_metrics/torchserve_custom.mtail deleted file mode 100644 index 15e642d762..0000000000 --- a/examples/custom_metrics/torchserve_custom.mtail +++ /dev/null @@ -1,24 +0,0 @@ -counter request_count -gauge image_size -gauge model_name -gauge level -gauge host_name -gauge request_id -gauge time_stamp - -# Sample log -# 2021-08-27 21:15:03,376 - PredictionTime.Milliseconds:109.74|#ModelName:bert,Level:Model|#hostname:ubuntu-ThinkPad-E14,requestID:09ed6c2c-9380-480d-a61a-66bfea958c1d,timestamp:1630079103 -# 2021-08-27 21:15:03,376 - HandlerTime.Milliseconds:109.74|#ModelName:bert,Level:Model|#hostname:ubuntu-ThinkPad-E14,requestID:09ed6c2c-9380-480d-a61a-66bfea958c1d,timestamp:1630079103 -# 2021-09-02 00:24:34,001 - InferenceTime.Milliseconds:3.05|#ModelName:mnist,Level:Model|#hostname:ubuntu-ThinkPad-E14,requestID:ce9a3631-e509-4a82-91c4-482cd2a15cd9,timestamp:1630522474 - -const PATTERN /SizeOfImage\.Kilobytes:(\d+\.\d+)\|#ModelName:([a-zA-Z]+),Level:([a-zA-Z]+)\|#hostname:([a-zA-Z0-9-]+),requestID:([a-zA-Z0-9-]+),timestamp:([0-9]+)/ - -PATTERN{ - request_count++ - image_size = $1 - model_name = $2 - level = $3 - host_name = $4 - request_id = $5 - time_stamp = $6 -} From e1339e1792756d180c1bdfaafe370359043ee0bb Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Fri, 4 Aug 2023 00:57:15 -0700 Subject: [PATCH 3/7] fix lint error --- docs/metrics.md | 2 +- ts_scripts/spellcheck_conf/wordlist.txt | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/metrics.md b/docs/metrics.md index 7ddde3dbd7..168418d1d2 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -191,7 +191,7 @@ model_metrics: # backend metrics Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\ -When adding custom `model_metrics` in the metrics cofiguration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: [add_counter](#add-counter-based-metrics), +When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics). diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt index 902439747a..7692ca780d 100644 --- a/ts_scripts/spellcheck_conf/wordlist.txt +++ b/ts_scripts/spellcheck_conf/wordlist.txt @@ -1068,3 +1068,8 @@ chatGPT baseimage cuDNN Xformer +ExamplePercentMetric +HandlerMethodTime +InferenceRequestCount +PostprocessCallCount +RequestBatchSize From 9af12bfd1fd9e0cc11637b559cfda31a31270175 Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Thu, 17 Aug 2023 13:39:56 -0700 Subject: [PATCH 4/7] Update custom metrics example to work with backwards compatible API --- examples/custom_metrics/README.md | 110 ++++++++++++----------- examples/custom_metrics/metrics.yaml | 6 ++ examples/custom_metrics/mnist_handler.py | 46 +++++++--- 3 files changed, 100 insertions(+), 62 deletions(-) diff --git a/examples/custom_metrics/README.md b/examples/custom_metrics/README.md index 6a199c5f41..53199ab3b3 100644 --- a/examples/custom_metrics/README.md +++ b/examples/custom_metrics/README.md @@ -10,6 +10,8 @@ Run the commands given in following steps from the root directory of the reposit - Step 1: In this example we add the following custom metrics and access them in prometheus format via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md): - InferenceRequestCount + - InitializeCallCount + - PreprocessCallCount - PostprocessCallCount - RequestBatchSize - SizeOfImage @@ -65,42 +67,49 @@ scrape_configs: ```console curl http://127.0.0.1:8082/metrics +# HELP Requests2XX Torchserve prometheus counter metric with unit: Count +# TYPE Requests2XX counter +Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 1.0 # HELP PredictionTime Torchserve prometheus gauge metric with unit: ms # TYPE PredictionTime gauge -PredictionTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 23.3 -# HELP GPUMemoryUtilization Torchserve prometheus gauge metric with unit: Percent -# TYPE GPUMemoryUtilization gauge -# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds -# TYPE ts_queue_latency_microseconds counter -ts_queue_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 164.607 +PredictionTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 62.78 +# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes +# TYPE DiskUsage gauge +DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.438858032226562 # HELP WorkerLoadTime Torchserve prometheus gauge metric with unit: Milliseconds # TYPE WorkerLoadTime gauge -WorkerLoadTime{WorkerName="W-9000-mnist_1.0",Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 5818.0 -# HELP SizeOfImage Torchserve prometheus gauge metric with unit: kB -# TYPE SizeOfImage gauge -SizeOfImage{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 0.265625 -# HELP PostprocessCallCount Torchserve prometheus counter metric with unit: count -# TYPE PostprocessCallCount counter -PostprocessCallCount{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 -# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent -# TYPE GPUUtilization gauge +WorkerLoadTime{WorkerName="W-9000-mnist_1.0",Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 7425.0 # HELP Requests5XX Torchserve prometheus counter metric with unit: Count # TYPE Requests5XX counter -# HELP HandlerMethodTime Torchserve prometheus gauge metric with unit: ms -# TYPE HandlerMethodTime gauge -HandlerMethodTime{MethodName="preprocess",ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 13.740777969360352 -# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes -# TYPE MemoryAvailable gauge -MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 4584.23828125 -# HELP InferenceRequestCount Torchserve prometheus counter metric with unit: count -# TYPE InferenceRequestCount counter -InferenceRequestCount{Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE CPUUtilization gauge +CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 100.0 +# HELP WorkerThreadTime Torchserve prometheus gauge metric with unit: Milliseconds +# TYPE WorkerThreadTime gauge +WorkerThreadTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 3.0 +# HELP DiskAvailable Torchserve prometheus gauge metric with unit: Gigabytes +# TYPE DiskAvailable gauge +DiskAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 308.94310760498047 # HELP ts_inference_requests_total Torchserve prometheus counter metric with unit: Count # TYPE ts_inference_requests_total counter ts_inference_requests_total{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP GPUMemoryUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE GPUMemoryUtilization gauge # HELP HandlerTime Torchserve prometheus gauge metric with unit: ms # TYPE HandlerTime gauge -HandlerTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 23.17 +HandlerTime{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 62.64 +# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds +# TYPE ts_inference_latency_microseconds counter +ts_inference_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 64694.367 +# HELP MemoryUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE MemoryUtilization gauge +MemoryUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 53.1 +# HELP MemoryAvailable Torchserve prometheus gauge metric with unit: Megabytes +# TYPE MemoryAvailable gauge +MemoryAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 7677.29296875 +# HELP PostprocessCallCount Torchserve prometheus counter metric with unit: count +# TYPE PostprocessCallCount counter +PostprocessCallCount{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 # HELP ExamplePercentMetric Torchserve prometheus histogram metric with unit: percent # TYPE ExamplePercentMetric histogram ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="0.005",} 0.0 @@ -120,43 +129,42 @@ ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f ExamplePercentMetric_bucket{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",le="+Inf",} 1.0 ExamplePercentMetric_count{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 ExamplePercentMetric_sum{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 50.0 -# HELP WorkerThreadTime Torchserve prometheus gauge metric with unit: Milliseconds -# TYPE WorkerThreadTime gauge -WorkerThreadTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 3.0 -# HELP Requests2XX Torchserve prometheus counter metric with unit: Count -# TYPE Requests2XX counter -Requests2XX{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 1.0 +# HELP GPUUtilization Torchserve prometheus gauge metric with unit: Percent +# TYPE GPUUtilization gauge +# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes +# TYPE MemoryUsed gauge +MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 7903.734375 # HELP QueueTime Torchserve prometheus gauge metric with unit: Milliseconds # TYPE QueueTime gauge QueueTime{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 0.0 -# HELP MemoryUtilization Torchserve prometheus gauge metric with unit: Percent -# TYPE MemoryUtilization gauge -MemoryUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 72.0 -# HELP GPUMemoryUsed Torchserve prometheus gauge metric with unit: Megabytes -# TYPE GPUMemoryUsed gauge -# HELP ts_inference_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds -# TYPE ts_inference_latency_microseconds counter -ts_inference_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 26736.381 -# HELP DiskAvailable Torchserve prometheus gauge metric with unit: Gigabytes -# TYPE DiskAvailable gauge -DiskAvailable{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 306.9124526977539 +# HELP ts_queue_latency_microseconds Torchserve prometheus counter metric with unit: Microseconds +# TYPE ts_queue_latency_microseconds counter +ts_queue_latency_microseconds{model_name="mnist",model_version="default",hostname="88665a372f4b.ant.amazon.com",} 115.79 +# HELP PreprocessCallCount Torchserve prometheus counter metric with unit: count +# TYPE PreprocessCallCount counter +PreprocessCallCount{ModelName="mnist",Hostname="88665a372f4b.ant.amazon.com",} 1.0 # HELP RequestBatchSize Torchserve prometheus gauge metric with unit: count # TYPE RequestBatchSize gauge RequestBatchSize{ModelName="mnist",Hostname="88665a372f4b.ant.amazon.com",} 1.0 -# HELP DiskUsage Torchserve prometheus gauge metric with unit: Gigabytes -# TYPE DiskUsage gauge -DiskUsage{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8.438858032226562 +# HELP SizeOfImage Torchserve prometheus gauge metric with unit: kB +# TYPE SizeOfImage gauge +SizeOfImage{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 0.265625 # HELP Requests4XX Torchserve prometheus counter metric with unit: Count # TYPE Requests4XX counter -# HELP MemoryUsed Torchserve prometheus gauge metric with unit: Megabytes -# TYPE MemoryUsed gauge -MemoryUsed{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 8699.48046875 -# HELP CPUUtilization Torchserve prometheus gauge metric with unit: Percent -# TYPE CPUUtilization gauge -CPUUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 33.3 +# HELP HandlerMethodTime Torchserve prometheus gauge metric with unit: ms +# TYPE HandlerMethodTime gauge +HandlerMethodTime{MethodName="preprocess",ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 25.554895401000977 +# HELP InitializeCallCount Torchserve prometheus counter metric with unit: count +# TYPE InitializeCallCount counter +InitializeCallCount{ModelName="mnist",Level="Model",Hostname="88665a372f4b.ant.amazon.com",} 1.0 # HELP DiskUtilization Torchserve prometheus gauge metric with unit: Percent # TYPE DiskUtilization gauge DiskUtilization{Level="Host",Hostname="88665a372f4b.ant.amazon.com",} 2.7 +# HELP GPUMemoryUsed Torchserve prometheus gauge metric with unit: Megabytes +# TYPE GPUMemoryUsed gauge +# HELP InferenceRequestCount Torchserve prometheus counter metric with unit: count +# TYPE InferenceRequestCount counter +InferenceRequestCount{Hostname="88665a372f4b.ant.amazon.com",} 1.0 ``` - Step 8: Navigate to `http://localhost:9090/` on a browser to execute queries and create graphs diff --git a/examples/custom_metrics/metrics.yaml b/examples/custom_metrics/metrics.yaml index 699ca19aea..a4f31fdfe3 100644 --- a/examples/custom_metrics/metrics.yaml +++ b/examples/custom_metrics/metrics.yaml @@ -72,6 +72,12 @@ model_metrics: - name: InferenceRequestCount unit: count dimensions: [] + - name: InitializeCallCount + unit: count + dimensions: [*model_name, *level] + - name: PreprocessCallCount + unit: count + dimensions: [*model_name] - name: PostprocessCallCount unit: count dimensions: [*model_name, *level] diff --git a/examples/custom_metrics/mnist_handler.py b/examples/custom_metrics/mnist_handler.py index db162d753d..632afd5a82 100644 --- a/examples/custom_metrics/mnist_handler.py +++ b/examples/custom_metrics/mnist_handler.py @@ -21,18 +21,31 @@ def initialize(self, context): super().initialize(context) metrics = context.metrics - # Usage of "add_metric" - self.inf_request_count = metrics.add_metric( + # "add_metric_to_cache" will only register/override(if already present) a metric object in the metric cache and will not emit it + self.inf_request_count = metrics.add_metric_to_cache( metric_name="InferenceRequestCount", unit="count", dimension_names=[], metric_type=MetricTypes.COUNTER, ) - metrics.add_metric( - metric_name="RequestBatchSize", + metrics.add_metric_to_cache( + metric_name="PreprocessCallCount", unit="count", dimension_names=["ModelName"], - metric_type=MetricTypes.GAUGE, + metric_type=MetricTypes.COUNTER, + ) + + # "add_metric" will register the metric if not already present in metric cache, + # include the "ModelName" and "Level" dimensions by default and emit it + metrics.add_metric( + name="InitializeCallCount", + value=1, + unit="count", + dimensions=[ + Dimension(name="ModelName", value=context.model_name), + Dimension(name="Level", value="Model"), + ], + metric_type=MetricTypes.COUNTER, ) def preprocess(self, data): @@ -50,10 +63,17 @@ def preprocess(self, data): metrics = self.context.metrics - # Usage of "add_or_update" + # "add_or_update" will emit the metric self.inf_request_count.add_or_update(value=1, dimension_values=[]) - # Usage of "get_metric" + # "get_metric" will fetch the corresponding metric from metric cache if present + preprocess_call_count_metric = metrics.get_metric( + metric_name="PreprocessCallCount", metric_type=MetricTypes.COUNTER + ) + preprocess_call_count_metric.add_or_update( + value=1, dimension_values=[self.context.model_name] + ) + request_batch_size_metric = metrics.get_metric( metric_name="RequestBatchSize", metric_type=MetricTypes.GAUGE ) @@ -63,7 +83,8 @@ def preprocess(self, data): input = data[0].get("body") - # Usage of "add_size" + # "add_size" will register the metric if not already present in metric cache, + # include the "ModelName" and "Level" dimensions by default and emit it metrics.add_size( name="SizeOfImage", value=len(input) / 1024, idx=None, unit="kB" ) @@ -72,7 +93,8 @@ def preprocess(self, data): preprocess_stop = time.time() - # usage of add_time + # "add_time" will register the metric if not already present in metric cache, + # include the "ModelName" and "Level" dimensions by default and emit it metrics.add_time( name="HandlerMethodTime", value=(preprocess_stop - preprocess_start) * 1000, @@ -93,11 +115,13 @@ def postprocess(self, data): Returns: list : A list of dictionary with predictons and explanations are returned. """ - # usage of add_counter + # "add_counter" will register the metric if not already present in metric cache, + # include the "ModelName" and "Level" dimensions by default and emit it self.context.metrics.add_counter( name="PostprocessCallCount", value=1, idx=None, dimensions=[] ) - # usage of add_percent + # "add_percent" will register the metric if not already present in metric cache, + # include the "ModelName" and "Level" dimensions by default and emit it self.context.metrics.add_percent( name="ExamplePercentMetric", value=50, From d7fe4eb9cddfe63c290802747f8b107a47d95300 Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Thu, 17 Aug 2023 18:19:01 -0700 Subject: [PATCH 5/7] Update custom metrics API documentation --- docs/metrics.md | 124 ++++++++++++++++++++------- ts/metrics/metric_cache_yaml_impl.py | 2 +- 2 files changed, 94 insertions(+), 32 deletions(-) diff --git a/docs/metrics.md b/docs/metrics.md index 168418d1d2..9c350c9dc6 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -10,7 +10,7 @@ * [Custom Metrics API](#custom-metrics-api) * [Logging custom metrics](#log-custom-metrics) * [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example) -* [Backwards compatibility warnings](#backwards-compatibility-warnings) +* [Backwards compatibility warnings](#backwards-compatibility-warnings-and-upgrade-guide) ## Introduction @@ -197,7 +197,7 @@ When adding custom `model_metrics` in the metrics configuration file, ensure to ### How it works -Whenever torchserve starts, the [backend worker](https://github.com/pytorch/serve/blob/master/ts/model_service_worker.py) initializes `service.context.metrics` with the [MetricsCache](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_cache_yaml_impl.py) object. The `model_metrics` (backend metrics) section within the specified yaml file will be parsed, and Metric objects will be created based on the parsed section and added that are added to the cache. +Whenever torchserve starts, the [backend worker](https://github.com/pytorch/serve/blob/master/ts/model_service_worker.py) initializes `service.context.metrics` with the [MetricsCache](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_cache_yaml_impl.py) object. The `model_metrics` (backend metrics) section within the specified yaml file will be parsed, and Metric objects will be created based on the parsed section and added to the cache. This is all done internally, so the user does not have to do anything other than specifying the desired yaml file. @@ -248,7 +248,7 @@ When adding any metric via Metrics API, users have the ability to override the m `metric_type=MetricTypes.[COUNTER/GAUGE/HISTOGRAM]`. ```python -metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) +metric1 = metrics.add_metric_to_cache("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) metric.add_or_update(value, dimension_values=["value1", "value2", ...]) # Backwards compatible, combines the above two method calls @@ -316,31 +316,35 @@ dimN= Dimension(name_n, value_n) One can add metrics with generic units using the following function. -Function API +#### Function API to add generic metrics without default dimensions ```python - def add_metric(self, metric_name: str, unit: str, idx=None, dimension_names: list = None, - metric_type: MetricTypes = MetricTypes.COUNTER) -> None: + def add_metric_to_cache( + self, + metric_name: str, + unit: str, + dimension_names: list = [], + metric_type: MetricTypes = MetricTypes.COUNTER, + ) -> CachingMetric: """ - Create a new metric and add into cache. - Add a metric which is generic with custom metrics + Create a new metric and add into cache. Override existing metric if already present. Parameters ---------- - metric_name: str + metric_name str Name of metric - value: int, float - value of metric - unit: str - unit of metric - idx: int - request_id index in batch - dimensions: list - list of dimensions for the metric - metric_type: MetricTypes - Type of metric + unit str + unit can be one of ms, percent, count, MB, GB or a generic string + dimension_names list + list of dimension name strings for the metric + metric_type MetricTypes + Type of metric Counter, Gauge, Histogram + Returns + ------- + newly created Metrics object """ + def add_or_update( self, value: int or float, @@ -365,10 +369,52 @@ Function API # Add Distance as a metric # dimensions = [dim1, dim2, dim3, ..., dimN] # Assuming batch size is 1 for example -metric = metrics.add_metric('DistanceInKM', unit='km', dimension_names=[...]) +metric = metrics.add_metric_to_cache('DistanceInKM', unit='km', dimension_names=[...]) metric.add_or_update(distance, dimension_values=[...]) ``` +Note that calling `add_metric_to_cache` will not emit the metric, `add_or_update` will need to be called on the metric object as shown above. + +#### Function API to add generic metrics with default dimensions + +```python + def add_metric( + self, + name: str, + value: int or float, + unit: str, + idx: str = None, + dimensions: list = [], + metric_type: MetricTypes = MetricTypes.COUNTER, + ): + """ + Add a generic metric + Default metric type is counter + + Parameters + ---------- + name : str + metric name + value: int or float + value of the metric + unit: str + unit of metric + idx: str + request id to be associated with the metric + dimensions: list + list of Dimension objects for the metric + metric_type MetricTypes + Type of metric Counter, Gauge, Histogram + """ +``` + +```python +# Add Distance as a metric +# dimensions = [dim1, dim2, dim3, ..., dimN] +metric = metrics.add_metric('DistanceInKM', value=10, unit='km', dimensions=[...]) +``` + + ### Add time-based metrics **Time-based metrics are defaulted to a `GAUGE` metric type** @@ -629,22 +675,38 @@ class CustomHandlerExample: metrics.add_size("GaugeModelMetricNameExample", 42.5) ``` -## Backwards compatibility warnings +## Backwards compatibility warnings and upgrade guide 1. Starting [v0.6.1](https://github.com/pytorch/serve/releases/tag/v0.6.1), the `add_metric` API signature changed\ - from [add_metric(name, value, unit, idx=None, dimensions=None)](https://github.com/pytorch/serve/blob/61f1c4182e6e864c9ef1af99439854af3409d325/ts/metrics/metrics_store.py#L184)\ - to [add_metric(metric_name, unit, dimension_names, metric_type)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272).\ + from: [add_metric(name, value, unit, idx=None, dimensions=None)](https://github.com/pytorch/serve/blob/61f1c4182e6e864c9ef1af99439854af3409d325/ts/metrics/metrics_store.py#L184)\ + to: [add_metric(metric_name, unit, dimension_names=None, metric_type=MetricTypes.COUNTER)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272).\ + In versions greater than v0.8.1 the `add_metric` API signature was updated to support backwards compatibility:\ + from: [add_metric(metric_name, unit, dimension_names=None, metric_type=MetricTypes.COUNTER)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272)\ + to: `add_metric(name, value, unit, idx=None, dimensions=[], metric_type=MetricTypes.COUNTER)`\ Usage of the new API is shown [above](#specifying-metric-types).\ + **Upgrade paths**: + - **[< v0.6.1] to [v0.6.1 - v0.8.1]**\ There are two approaches available when migrating to the new custom metrics API: - - Replace the call to `add_metric` in versions prior to v0.6.1 with calls to the following methods: - ``` - metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) - metric1.add_or_update(value, dimension_values=["value1", "value2", ...]) - ``` - - Replace the call to `add_metric` in versions prior to v0.6.1 with one of the suitable custom metrics APIs where applicable: [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics), - [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics) + - Replace the call to `add_metric` with calls to the following methods: + ```python + metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) + metric1.add_or_update(value, dimension_values=["value1", "value2", ...]) + ``` + - Replace the call to `add_metric` in versions prior to v0.6.1 with one of the suitable custom metrics APIs where applicable: [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics), + [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics) + - **[< v0.6.1] to [> v0.8.1]**\ + The call to `add_metric` is backwards compatible but the metric type is inferred to be `COUNTER`. If the metric is of a different type, an additional argument `metric_type` will need to be provided to the `add_metric` + call shown below + ```python + metrics.add_metric(name='GenericMetric', value=10, unit='count', dimensions=[...], metric_type=MetricTypes.GAUGE) + ``` + - **[v0.6.1 - v0.8.1] to [> v0.8.1]**\ + Replace the call to `add_metric` with `add_metric_to_cache`. 2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml)) are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) based on the `metrics_mode` configuration as described [above](#introduction).\ The default `metrics_mode` is `log` mode.\ This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds` - which were only available via the metrics API endpoint. + which were only available via the metrics API endpoint.\ + **Upgrade paths**: + - **[< v0.8.0] to [>= v0.8.0]**\ + Specify all the custom metrics added to the custom handler in the metrics configuration file as shown [above](#central-metrics-yaml-file-definition). diff --git a/ts/metrics/metric_cache_yaml_impl.py b/ts/metrics/metric_cache_yaml_impl.py index 7206c83c30..fa170dd816 100644 --- a/ts/metrics/metric_cache_yaml_impl.py +++ b/ts/metrics/metric_cache_yaml_impl.py @@ -109,7 +109,7 @@ def add_metric_to_cache( metric_type: MetricTypes = MetricTypes.COUNTER, ) -> CachingMetric: """ - Create a new metric and add into cache + Create a new metric and add into cache. Override existing metric with same name if present. Parameters ---------- From c282d5f839c725e348cc898923ba19328b77800a Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Thu, 17 Aug 2023 18:28:21 -0700 Subject: [PATCH 6/7] Fix linter error --- ts_scripts/spellcheck_conf/wordlist.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt index 7692ca780d..7618579767 100644 --- a/ts_scripts/spellcheck_conf/wordlist.txt +++ b/ts_scripts/spellcheck_conf/wordlist.txt @@ -1073,3 +1073,5 @@ HandlerMethodTime InferenceRequestCount PostprocessCallCount RequestBatchSize +InitializeCallCount +PreprocessCallCount From c4965dd798fb74564077484b12141b6f64e92a9f Mon Sep 17 00:00:00 2001 From: Naman Nandan Date: Fri, 18 Aug 2023 15:35:42 -0700 Subject: [PATCH 7/7] fix documentation --- docs/metrics.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/metrics.md b/docs/metrics.md index 9c350c9dc6..48b2065feb 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -10,7 +10,7 @@ * [Custom Metrics API](#custom-metrics-api) * [Logging custom metrics](#log-custom-metrics) * [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example) -* [Backwards compatibility warnings](#backwards-compatibility-warnings-and-upgrade-guide) +* [Backwards compatibility warnings and upgrade guide](#backwards-compatibility-warnings-and-upgrade-guide) ## Introduction @@ -191,7 +191,8 @@ model_metrics: # backend metrics Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\ -When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: [add_counter](#add-counter-based-metrics), +When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: +[add_metric](#function-api-to-add-generic-metrics-with-default-dimensions), [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics).