Feature: implement the framework of the new metrics system #922

empiredan · 2022-03-04T10:51:38Z

1. Motivation

The root cause of introducing a new metrics system is the unreasonable naming of the perf-counter, as it's been described in 2020-08-27-metric-api.md. The naming is verbose: besides that there are meaningless words in it, some terms can be moved as the labels/tags which tend to have common properties.

There're also other reasons to refactor the perf-counter.

Firstly, in perf-counter, the types of metrics have not been strictly separated, such as gauge and counter. For example, set is not an efficient operation for counter.

Secondly, as the base class, perf-counter has all the abstract interfaces for each type of metric. Every sub class has to implement all the abstract interfaces. For example, it is unreasonable that a gauge has get_percentile method, though it can simply be implemented as dassert(false, ...);.

Thirdly, there're performance problems for perf_counter_number_atomic. It should be optimized and a more efficient counter should be provided.

2. The Framework

Even if we need to implement a new framework, an alternative solution is that we can turn to mature and active metrics lib for c++. However, there's few of this kind of libs. https://github.com/jupp0r/prometheus-cpp is a possible candidate. It implements Prometheus Data Model which is naturally supported by Prometheus. It can also be exported and translated into the other data model of metrics. Nevertheless, its implementation is not efficient enough for high-concurrency system; even sampling for Summary (a metric type of Prometheus, also known as percentile) will block the task thread.

Therefore, just like what we've done before, we should introduce an abstraction level for the new metrics system with well-defined metric types; then, collect the metric data and translate them into the metric models supported by each specific monitoring system, such as Prometheus or Open Falcon; finally, the metric data are gathered by the monitoring system and can be observed very conveniently.

There are some requirements for the new abstraction:

It should be able to applied to all of the existing metrics;
It should support to manage both of common and exclusive properties of the metrics, which later may be used as labels/tags;
It should provide separate and well-defined metric types;
The provided metric sampling should be fast enough without excessive cpu and memory consumption, and never block the task thread.

The new abstraction level is inspired by Kudu Metrics. Next I'll describe how it is implemented.

2.1 Metric Registry

All of the metrics are maintained in the metric registry. The metric registry is a singleton. While a metric entity (see 2.2) is instantiated, it will be registered into the registry.

The main use of registry is to collect all of metric data and export to the specific data sink (see 2.5), since it knows every metric.

2.2 Metric Entity Prototype

Each metric has a corresponding level, which is called an entity, such as the memory of a server, the put latency of a table, the get latency of a replica. All of server, table and replica are entities.

METRIC_DEFINE_entity(...) macro can be used to define an entity prototype. An entity prototype is an abstract class of an entity. To become a concrete entity, a prototype should be instantiated with some attributions. As is mentioned in 2.1, during instantiation the entity will also be registered into the registry with its unique ID.

For example, a server can be instantiated with its IP or hostname; a table can be instantiated with its app id or name; a replica can be instantiated with its app id and partition id.

2.3 Metric Prototype

Similar with the entity prototype, a metric should also be instantiated from a metric prototype. A metric prototype defines the basic meta information of a metric, including the entity it is attached to, its name, its unit (such as bytes, milliseconds, operations per second, etc.), and its description. To define a metric prototype, just use METRIC_DEFINE_* macro.

Once a metric prototype is defined, it can be used to instantiate a metric, which will also be attached to the specific entity thus maintained by a registry.

2.4 The Types of Metrics

According to the current metrics we've used in Pegasus, there are 5 types for the new framework: gauge, counter, volatile counter, meter and percentile. For simplicity，a brief introduction is given for each type; the implementation details will be described in the each issue of the types.

2.4.1 The Gauge

A gauge is a point-in-time measurement. The value got from a gauge is exactly the same as what it has been recently set.

There are 2 primitive types supported for the gauge: int64_t for a value of integer and double for a value of floating-point.

An example of a gauge is meta*eon.meta_service*unalive_nodes, which shows the current number of dead replica servers.

2.4.2 The Counter

Typically a counter is a metric that only monotonically increases (Counter in Prometheus). Similarly, in pegasus, there are metrics of this kind of counter, such as zion*profiler*RPC_RRDB_RRDB_PUT.cancelled.

However, in Pegasus there are also another kind of counter that may sometimes decrease, such as replica*app.pegasus*manual.compact.running.count, which will increase while a new manual compact is started, and decrease while it is finished.

Therefore, the counter implemented for Pegasus should support both increase and decrease, just like what has been done in Counters of Dropwizard.

2.4.3 The Volatile Counter

Fetching value from a general counter is trivial: just return the current value of the counter is ok.

However, many metrics are "recent" in Pegasus, which means that the historically-accumulated count is ignored. If a counter is "recent", it will be reset to 0 immediately after its value is fetched. Thus the "recent" is the duration between 2 successive accesses to the counter. For example replica*eon.failure_detector*recent_beacon_fail_count means how many failed beacons there are recently.

According to this scenario, a special counter (called a volatile counter) is implemented. The only difference between it and a general counter is that once the value is read, a volatile counter will be immediately cleared to 0, while a general counter is kept unchanged.

2.4.4 The Meter

A meter measures the rate of occurrences of a set of events over time. In Pegasus the typical application is QPS, such as zion*profiler*RPC_RRDB_RRDB_GET.qps, which measures the rate at which a replica server reads a single value.

It should be noted that the underlying counter of a meter is volatile, which means it will be reset to 0 immediately after the value of the meter is fetched.

2.4.5 The Percentile

Like Summary in Prometheus, the metric type of percentile samples observations. Periodically it calculates configurable percentiles (p50, p75, p90, p95, p99, p999, etc.) over the recent samples.

The most common usage of the percentile is latency, such as zion*profiler*RPC_RRDB_RRDB_PUT.latency.server.p999, which is the 999th percentile of server-side latency between the moment the task is pushed into the queue and the point the server begins to reply to client.

2.5 Metric Data Sink

As it is described in the beginning of this chapter, the ultimate goal is to show the metrics in the monitoring system. Therefore, the metric data sink, as its name implies, actually is the metric model supported by each monitoring system to which the metric data are collected.

Based on this design, the base class of the data sink must be abstract. It is necessary to implement the sub class for each metric model of target, such as Prometheus or Open Falcon.

In the sub class, the metric data will be translated into the specific model. For Prometheus, a client should be created and initialized, listening on a port which is used for the Prometheus server to pull data.

2.6 Clean the Stale Metrics

A table could be dropped in Pegasus. All of its metrics then will become useless if it's no longer recalled.

In the old metrics system (i.e. perf-counters) the expired metrics won't be cleared. It still exists in the memory. While the process of replica server could run for a very long term, the uncleaned outdated metrics will lead to memory leak.

Thus a mechanism should be introduced to drop the stale metrics. We can use the mechanism employed by Kudu (see MetricEntity::RetireOldMetrics()).

Since ref_ptr is used to be the type of a metric, we can just check the count of ref_counter. Generally if a table is in use, the count will be 2: one for a ref_ptr in metric entity (and the entity in metric registry), and another for a ref_ptr held by the user class (the table-related class). Once the user object is destructed, the count will become 1, which means the metric is not needed. This can be periodically detected. After a period of configurable time, if a metric is still unused, it will be dropped from memory.

When it comes to the table-level entity, once a table is dropped (thus the ref count will become 1), actually the whole table entity can be removed (thus all its metrics will be cleared). In a word, we can retire the metrics from memory according to its ref count.

3. Examples

3.1 Define & Instantiate Metric Entities

In this section, some examples are given to show how to define and instantiate metric entities.

3.1.1 Define & Instantiate Server Entity

// Define server entity
METRIC_DEFINE_entity(server);

// Instantiate server entity
auto server_entity = METRIC_ENTITY_server.instantiate("server");

3.1.2 Define & Instantiate Table Entity

// Define table entity
METRIC_DEFINE_entity(table);

// Instantiate table entity
std::string table_name("test_app1");
auto table_entity = METRIC_ENTITY_table.instantiate(
    table_name, 
    {{"table", table_name}}
);
}

3.1.3 Define & Instantiate Replica Entity

// Define replica entity
METRIC_DEFINE_entity(replica);

// Instantiate replica entity
std::string table_name("test_app1");
int32_t partition_id = 2;
auto replica_id = fmt::format("{}:{}", table_name, partition_id);
auto replica_entity = METRIC_ENTITY_replica.instantiate(
    replica_id, 
    {{"table", table_name}, {"partition", std::to_string(partition_id)}}
);

3.2 Define metrics

In this section, take the put latency for example to show how to define and instantiate metrics.

3.2.1 Define & Instantiate Server-level Metric

// Define server-level metric
METRIC_DEFINE_percentile(server, server_put_latency, kNanoSeconds,
    "the server-level latency of put requests");

// Instantiate metric
auto server_put_latency = METRIC_server_put_latency.instantiate(
    server_entity, 
    {90, 95, 99}
);

3.2.2 Define & Instantiate Table-level Metric

// Define replica-level metric
METRIC_DEFINE_percentile(table, table_put_latency, kNanoSeconds,
    "the table-level latency of put requests");

// Instantiate metric
auto table_get_latency = METRIC_table_put_latency.instantiate(
    table_entity, 
    {90, 99}
);

3.2.3 Define & Instantiate Replica-level Metric

// Define replica-level metric
METRIC_DEFINE_percentile(replica, replica_put_latency, kNanoSeconds,
    "the replica-level latency of put requests");

// Instantiate metric
auto replica_get_latency = METRIC_replica_put_latency.instantiate(
    replica_entity, 
    {50, 75, 90, 95, 99, 999}
);

4. Schedule

All of the sub tasks to implement the new framework are listed here to track the status of each of them.

4.1 Implement the Metric Entity & its Prototype

Feature: implement the metric entity and its prototype #925

4.2 Implement the Metric Registry

Feature: implement the metric registry #927

4.3 Implement the Metric & its Prototype

Feature: implement the metric and its prototype #928

4.4 Implement the Metric Type of Gauge

4.5 Implement the Metric Types of Counters (includes the Volatile Counter)

4.6 Implement the Metric Type of Percentile

4.7 Merge prometheus-rebased-dev into master branch

Feature(new_metrics): merge prometheus-dev into master branch #1010

4.8 Collect Metrics

4.9 Implement the Metric Data Sink of Prometheus

4.10 Implement the Metric Data Sink of Open Falcon

4.11 Clean the Stale Metrics

The text was updated successfully, but these errors were encountered:

empiredan added the type/enhancement Indicates new feature requests label Mar 4, 2022

This was referenced Mar 4, 2022

Introduce new metric API #756

Closed

Feature: implement the metric entity and its prototype #925

Closed

This was referenced Mar 11, 2022

Feature: implement the metric registry #927

Closed

Feature: implement the metric and its prototype #928

Closed

This was referenced Mar 22, 2022

Feature: implement the gauge #929

Closed

Feature: implement the counter #931

Closed

Feature: implement the volatile counter #933

Closed

empiredan mentioned this issue Jun 1, 2022

Feature(new_metrics): implement the percentile #991

Closed

empiredan mentioned this issue Aug 15, 2022

Feature(new_metrics): introduce data sink to collect snapshot from each metric periodically #1116

Open

empiredan mentioned this issue Jan 5, 2023

Feature(new_metrics): retire stale metric entities that are not used by any other object #1303

Closed

empiredan mentioned this issue Jan 28, 2023

Feature(new_metrics): migrate to the new metrics system #1325

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: implement the framework of the new metrics system #922

Feature: implement the framework of the new metrics system #922

empiredan commented Mar 4, 2022 •

edited

Loading

Feature: implement the framework of the new metrics system #922

Feature: implement the framework of the new metrics system #922

Comments

empiredan commented Mar 4, 2022 • edited Loading

1. Motivation

2. The Framework

2.1 Metric Registry

2.2 Metric Entity Prototype

2.3 Metric Prototype

2.4 The Types of Metrics

2.4.1 The Gauge

2.4.2 The Counter

2.4.3 The Volatile Counter

2.4.4 The Meter

2.4.5 The Percentile

2.5 Metric Data Sink

2.6 Clean the Stale Metrics

3. Examples

3.1 Define & Instantiate Metric Entities

3.1.1 Define & Instantiate Server Entity

3.1.2 Define & Instantiate Table Entity

3.1.3 Define & Instantiate Replica Entity

3.2 Define metrics

3.2.1 Define & Instantiate Server-level Metric

3.2.2 Define & Instantiate Table-level Metric

3.2.3 Define & Instantiate Replica-level Metric

4. Schedule

4.1 Implement the Metric Entity & its Prototype

4.2 Implement the Metric Registry

4.3 Implement the Metric & its Prototype

4.4 Implement the Metric Type of Gauge

4.5 Implement the Metric Types of Counters (includes the Volatile Counter)

4.6 Implement the Metric Type of Percentile

4.7 Merge prometheus-rebased-dev into master branch

4.8 Collect Metrics

4.9 Implement the Metric Data Sink of Prometheus

4.10 Implement the Metric Data Sink of Open Falcon

4.11 Clean the Stale Metrics

empiredan commented Mar 4, 2022 •

edited

Loading