From b20b99e0b3e69eb3b0ded0225b2468ea8c28bb81 Mon Sep 17 00:00:00 2001 From: Reiley Yang Date: Thu, 25 Feb 2021 12:20:33 -0800 Subject: [PATCH] Metrics prototype scenario (#146) * metrics prototype scenario * rename * fix typo * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Cijo Thomas * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Cijo Thomas * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Cijo Thomas * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Cijo Thomas * clarify GA/stable * add example to exemplar * adjust the proposed timeline considering most folks will be on vacation in Dec. * fix nits * adjust the ToC * fix nits * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Leighton Chen * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Leighton Chen * adjust wording * fix typo * address review comment * address comments * Update text/metrics/0146-metrics-prototype-scenarios.md Co-authored-by: Cijo Thomas Co-authored-by: Cijo Thomas Co-authored-by: Leighton Chen --- .../0146-metrics-prototype-scenarios.md | 222 ++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 oteps/metrics/0146-metrics-prototype-scenarios.md diff --git a/oteps/metrics/0146-metrics-prototype-scenarios.md b/oteps/metrics/0146-metrics-prototype-scenarios.md new file mode 100644 index 000000000..fe370bde3 --- /dev/null +++ b/oteps/metrics/0146-metrics-prototype-scenarios.md @@ -0,0 +1,222 @@ +# Scenarios for Metrics API/SDK Prototyping + +With the stable release of the tracing specification, the OpenTelemetry +community is willing to spend more energy on metrics API/SDK. The goal is to get +the metrics API/SDK specification to +[`Experimental`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental) +state by end of 5/2021, and make it +[`Stable`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable) +before end of 2021: + +* By end of 5/31/2021, we should have a good confidence that we can recommend + language client owners to work on metrics preview release. This means starting + from 6/1/2021 the specification should not have major surprises or big + changes. We will then start recommending client maintainers to implement it. + We might introduce additional features but there should be a high bar. + +* By end of 9/30/2021, we should mark the metrics API/SDK specification as + [`Feature-freeze`](https://github.com/open-telemetry/opentelemetry-specification/blob/1afab39e5658f807315abf2f3256809293bfd421/specification/document-status.md#feature-freeze), + and focus on bug fixing or editorial changes. + +* By end of 11/30/2021, we want to have a stable release of metrics API/SDK + specification, with multiple language SIGs providing RC (release candidate) or + [stable](https://github.com/open-telemetry/opentelemetry-specification/blob/9047c91412d3d4b7f28b0f7346d8c5034b509849/specification/versioning-and-stability.md#stable) + clients. + +In this document, we will focus on two scenarios that we use for prototyping +metrics API/SDK. The goal is to have two scenarios which clearly capture the +major requirements, so we can work with language client SIGs to prototype, +gather the learnings, determine the scopes and stages. Later the scenarios can +be used as examples and test cases for all the language clients. + +Here are the languages we've agreed to use during the prototyping: + +* C# +* Java +* Python + +In order to not undertake such an enormous task at once, we will need to have an incremental approach and divide the work into multiple +stages: + +1. Do the end-to-end prototype to get the overall understanding of the problem + domain. We should also clarify the scope and be able to articulate it + precisely during this stage, here are some examples: + + * Why do we want to introduce brand new metrics APIs versus taking a well + established API (e.g. Prometheus and Micrometer), what makes OpenTelemetry + metrics API different (e.g. Baggage)? + * Do we need to consider OpenCensus Stats API shim, or this is out of scope? + +2. Focus on a core subset of API, cover the end-to-end library instrumentation + scenario. At this stage we don't expect to cover all the APIs as some of them + might be very similar (e.g. if we know how to record an integer, we don't + have to work on float/double as we can add them later by replicating what + we've done for integer). + +3. Focus on a core subset of SDK. This would help us to get the end-to-end + application. + +4. Replicate stage 2 to cover the complete set of APIs. + +5. Replicate stage 3 to cover the complete set of SDKs. + +## Scenario 1: Grocery + +The **Grocery** scenario covers how a developer could use metrics API and SDK in +a final application. It is a self-contained application which covers: + +* How to instrument the code in a vendor agnostic way +* How to configure the SDK and exporter + +Considering there might be multiple grocery stores, the metrics we collect will +have the store name as a dimension - which is fairly static (not changing while +the store is running). + +The store has plenty supply of potato and tomato, with the following price: + +* Potato: $1.00 / ea +* Tomato: $3.00 / ea + +Each customer has a unique name (e.g. customerA, customerB), a customer could +come to the store multiple times. Here goes the Python snippet: + +```python +store = GroceryStore("Portland") +store.process_order("customerA", {"potato": 2, "tomato": 3}) +store.process_order("customerB", {"tomato": 10}) +store.process_order("customerC", {"potato": 2}) +store.process_order("customerA", {"tomato": 1}) +``` + +We will need the following metrics every minute: + +**Order info:** + +| Store | Customer | Number of Orders | Amount (USD) | +| -------- | --------- | ---------------- | ------------ | +| Portland | customerA | 2 | 14.00 | +| Portland | customerB | 1 | 30.00 | +| Portland | customerC | 1 | 2.00 | + +**Items sold:** + +| Store | Customer | Item | Count | +| -------- | --------- | ------ | ----- | +| Portland | customerA | potato | 2 | +| Portland | customerA | tomato | 4 | +| Portland | customerB | tomato | 10 | +| Portland | customerC | potato | 2 | + +## Scenario 2: HTTP Server + +The _HTTP Server_ scenario covers how a library developer X could use metrics +API to instrument a library, and how the application developer Y can configure +the library to use OpenTelemetry SDK in a final application. X and Y are working +for different companies and they don't communicate. The demo has two parts - the +library (HTTP lib and ClimateControl lib owned by X) and the server app (owned by Y): + +* How developer X could instrument the library code in a vendor agnostic way + * Performance is critical for X + * X doesn't know which metrics and which dimensions will Y pick + * X doesn't know the aggregation time window, nor the final destination of the + metrics + * X would like to provide some default recommendation (e.g. default + dimensions, aggregation time window, histogram buckets) so consumers of his + library can have a better onboarding experience. +* How developer Y could configure the SDK and exporter + * How should Y hook up the metrics SDK with the library + * How should Y configure the time window(s) and destination(s) + * How should Y pick the metrics and the dimensions + +### Library Requirements + +The library developer (developer X) will provide two libraries: + +* Server climate control library - a library which monitors and controls the + temperature and humidity of the server. +* HTTP server library - a library which provides HTTP service. + +Both libraries will provide out-of-box metrics, the metrics have two categories: + +* Push metrics - the value is reported (via the API) when it is available, and + collected (via the SDK) based on the ask from consumer(s). If there is no ask + from the consumer, the API will be no-op and the data will be dropped on the + floor. +* Pull metrics - the value is always available, and is only reported and + collected based on the ask from consumer(s). If there is no ask from the + consumer, the value will not be reported at all (e.g. there is no API call to + fetch the temperature unless someone is asking for the temperature). + +#### Server Climate Control Library + +Note: the **Host Name** should leverage [`OpenTelemetry +Resource`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md), +so it should be covered by the metrics SDK rather than API, and strictly +speaking it is not considered as a "dimension" from the SDK perspective. + +**Server temperature:** + +| Host Name | Temperature (F) | +| --------- | --------------- | +| MachineA | 65.3 | + +**Server humidity:** + +| Host Name | Humidity (%) | +| --------- | ------------ | +| MachineA | 21 | + +#### HTTP Server Library + +**Received HTTP requests:** + +Note: the **Client Type** is passed in via the [`OpenTelemetry +Baggage`](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/baggage/api.md), +strictly speaking it is not part of the metrics API, but it is considered as a +"dimension" from the metrics SDK perspective. + +| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | +| --------- | ---------- | ----------- | ----------- | --------- | ----------- | --------- | --------- | --------- | --------- | +| MachineA | 1234 | Android | GET | otel.org | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 | +| MachineA | 1234 | Android | POST | otel.org | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 | +| MachineA | 1234 | iOS | PUT | otel.org | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 | + +**HTTP server request duration:** + +Note: the server duration is only available for **finished HTTP requests**. + +| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Duration (ms) | +| --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | ------------- | +| MachineA | 1234 | Android | GET | otel.org | 200 | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 | 8.5 | +| MachineA | 1234 | Android | POST | otel.org | 304 | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 | 100.0 | + +### Application Requirements + +The application owner (developer Y) would only want the following metrics: + +* Server temperature - reported every 5 seconds +* Server humidity - reported every minute +* HTTP server request duration, reported every 5 seconds, with a subset of the + dimensions: + * Host Name + * HTTP Method + * HTTP Host + * HTTP Status Code + * Client Type + * 90%, 95%, 99% and 99.9% server duration +* HTTP request counters, reported every 5 seconds: + * Total number of received HTTP requests + * Total number of finished HTTP requests + * Number of currently-in-flight HTTP requests (concurrent HTTP requests) + + | Host Name | Process ID | HTTP Host | Received Requests | Finished Requests | Concurrent Requests | + | --------- | ---------- | --------- | ----------------- | ----------------- | ------------------- | + | MachineA | 1234 | otel.org | 630 | 601 | 29 | + | MachineA | 5678 | otel.org | 1005 | 1001 | 4 | +* Exception samples (exemplar) - in case HTTP 5xx happened, developer Y would + want to see a sample request with trace id, span id and all the dimensions + (IP, Port, etc.) + + | Trace ID | Span ID | Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Exception | + | -------------------------------- | ---------------- | --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | -------------------- | + | 8389584945550f40820b96ce1ceb9299 | 745239d26e408342 | MachineA | 1234 | iOS | PUT | otel.org | 500 | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 | SocketException(...) |