Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated example for custom metrics and add backwards compatibility warnings and upgrade guide for metrics APIs #2516

Merged
merged 13 commits into from
Aug 24, 2023
Merged
52 changes: 39 additions & 13 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
* [Custom Metrics API](#custom-metrics-api)
* [Logging custom metrics](#log-custom-metrics)
* [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example)
* [Backwards compatibility warnings](#backwards-compatibility-warnings)

## Introduction

Expand All @@ -28,7 +29,7 @@ Metrics are collected by default at the following locations in `log` mode:

The location of log files and metric files can be configured in the [log4j2.xml](https://github.com/pytorch/serve/blob/master/frontend/server/src/main/resources/log4j2.xml) file

In `prometheus` mode, all metrics are made available in prometheus format via the [metrics](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) API endpoint.
In `prometheus` mode, all metrics are made available in prometheus format via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md).

## Frontend Metrics

Expand Down Expand Up @@ -187,7 +188,11 @@ model_metrics: # backend metrics
```


Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.
Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit.

Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\
When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: [add_counter](#add-counter-based-metrics),
[add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics).


### How it works
Expand Down Expand Up @@ -373,15 +378,15 @@ Add time-based by invoking the following method:
Function API

```python
def add_time(self, metric_name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None,
def add_time(self, name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None,
metric_type: MetricTypes = MetricTypes.GAUGE):
"""
Add a time based metric like latency, default unit is 'ms'
Default metric type is gauge

Parameters
----------
metric_name : str
name : str
metric name
value: int
value of metric
Expand Down Expand Up @@ -418,15 +423,15 @@ Add size-based metrics by invoking the following method:
Function API

```python
def add_size(self, metric_name: str, value: int or float, idx=None, unit: str = 'MB', dimensions: list = None,
def add_size(self, name: str, value: int or float, idx=None, unit: str = 'MB', dimensions: list = None,
metric_type: MetricTypes = MetricTypes.GAUGE):
"""
Add a size based metric
Default metric type is gauge

Parameters
----------
metric_name : str
name : str
metric name
value: int, float
value of metric
Expand Down Expand Up @@ -463,15 +468,15 @@ Percentage based metrics can be added by invoking the following method:
Function API

```python
def add_percent(self, metric_name: str, value: int or float, idx=None, dimensions: list = None,
def add_percent(self, name: str, value: int or float, idx=None, dimensions: list = None,
metric_type: MetricTypes = MetricTypes.GAUGE):
"""
Add a percentage based metric
Default metric type is gauge

Parameters
----------
metric_name : str
name : str
metric name
value: int, float
value of metric
Expand All @@ -485,6 +490,8 @@ Function API

```

**Inferred unit**: `percent`

To add custom percentage-based metrics:

```python
Expand All @@ -503,26 +510,25 @@ Counter based metrics can be added by invoking the following method
Function API

```python
def add_counter(self, metric_name: str, value: int or float, idx=None, dimensions: list = None,
metric_type: MetricTypes = MetricTypes.COUNTER):
def add_counter(self, name: str, value: int or float, idx=None, dimensions: list = None):
"""
Add a counter metric or increment an existing counter metric
Default metric type is counter
Parameters
----------
metric_name : str
name : str
metric name
value: int or float
value of metric
idx: int
request_id index in batch
dimensions: list
list of dimensions for the metric
metric_type: MetricTypes
type for defining different operations, defaulted to counter metric type for Counter metrics
"""
```

**Inferred unit**: `count`

### Getting a metric

Users can get a metric from the cache. The Metric object is returned, so the user can access the methods of the Metric: (i.e. `Metric.update(value)`, `Metric.__str__`)
Expand Down Expand Up @@ -622,3 +628,23 @@ class CustomHandlerExample:
# except this time with gauge metric type object
metrics.add_size("GaugeModelMetricNameExample", 42.5)
```

## Backwards compatibility warnings
1. Starting [v0.6.1](https://github.com/pytorch/serve/releases/tag/v0.6.1), the `add_metric` API signature changed\
from [add_metric(name, value, unit, idx=None, dimensions=None)](https://github.com/pytorch/serve/blob/61f1c4182e6e864c9ef1af99439854af3409d325/ts/metrics/metrics_store.py#L184)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the old metrics were always counters? If that's the case then keeping BC automatically shouldn't be too hard

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior metrics implementation did not have types associated with them. The new metrics implementation does add support for MetricTypes.

In addition, while the prior metrics implementation did not have a way to specify metrics and their specifications (name, unit, dimension names and type) in a central configuration file, the new metrics implementation introduced this, as a result which, the semantics of add_metric method was changed
from: Create a metric object and store in a list to emit
to: Add a metric object consisting only of its specifications (name, unit, dimension names and type) to a metics cache. The dimension values are provided at the time of updating a metric using the add_or_update method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of options to ensure backwards compatibility are as follows:

  1. Introduce a new method in metric_cache_abstract.py, say add_metric_bc which has the same signature as that of the old add_metric API. This method can internally call add_metric and then add_or_update on the metric object. The default metric type in this case would be counter.
  2. Change the name of the new add_metric method to add_metric_to_cache and reimplement the add_metric method to have the same signature as the old implementation. Then, the add_metric API can internally call add_metric_to_cache method and the add_or_update on the metric object.

Please share your thoughts on these approaches.

Copy link
Member

@msaroufim msaroufim Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 2, add_metric_to_cache() seems more like an internal detail whereas what a user wants to do is add_metric(). While the semantics do change, the code won't break and that seems like a win

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft PR to implement backwards compatibility for add_metric API: #2525

to [add_metric(metric_name, unit, dimension_names, metric_type)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272).\
Usage of the new API is shown [above](#specifying-metric-types).\
There are two approaches available when migrating to the new custom metrics API:
- Replace the call to `add_metric` in versions prior to v0.6.1 with calls to the following methods:
```
metric1 = metrics.add_metric("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE)
metric1.add_or_update(value, dimension_values=["value1", "value2", ...])
```
- Replace the call to `add_metric` in versions prior to v0.6.1 with one of the suitable custom metrics APIs where applicable: [add_counter](#add-counter-based-metrics), [add_time](#add-time-based-metrics),
[add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics)
2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml))
are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md)
based on the `metrics_mode` configuration as described [above](#introduction).\
The default `metrics_mode` is `log` mode.\
This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds`
which were only available via the metrics API endpoint.
Loading