From 6419946b8b89a00ef9c5ae156c22c20d433ff4a7 Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 3 Jul 2020 11:44:15 -0400 Subject: [PATCH 01/17] Fix links that weren't working --- specification/glossary.md | 2 +- specification/trace/sdk.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/specification/glossary.md b/specification/glossary.md index bf4b34c4ee6..e6bc756a861 100644 --- a/specification/glossary.md +++ b/specification/glossary.md @@ -33,7 +33,7 @@ Example: `org.mongodb.client`. ### Instrumentation Library -Denotes the library that provides the instrumentation for a given [Instrumented Library](#instrumented_library). +Denotes the library that provides the instrumentation for a given [Instrumented Library](#instrumented-library). *Instrumented Library* and *Instrumentation Library* may be the same library if it has built-in OpenTelemetry instrumentation. diff --git a/specification/trace/sdk.md b/specification/trace/sdk.md index b5f58a59d41..ac11dd45bc6 100644 --- a/specification/trace/sdk.md +++ b/specification/trace/sdk.md @@ -142,7 +142,7 @@ TODO: Split out the parent handling. ## Tracer Creation New `Tracer` instances are always created through a `TracerProvider` (see -[API](api.md#tracerprovicer)). The `name` and `version` arguments +[API](api.md#tracerprovider)). The `name` and `version` arguments supplied to the `TracerProvider` must be used to create an [`InstrumentationLibrary`][otep-83] instance which is stored on the created `Tracer`. From 2eb0e139ef031d607433e8db06dc98863406cc6c Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 13:30:07 -0400 Subject: [PATCH 02/17] Start zPages spec with basic details --- experimental/zpages.md | 112 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 experimental/zpages.md diff --git a/experimental/zpages.md b/experimental/zpages.md new file mode 100644 index 00000000000..ea5f99a6cbe --- /dev/null +++ b/experimental/zpages.md @@ -0,0 +1,112 @@ +# zPages +## Table of Contents +- [Overview](#overview) + - [Types of zPages](types-of-zpages) + - [TraceZ](#tracez) + - [TraceConfigz](#traceconfigz) + - [RPCz](#rpcz) + - [StatsZ](#statsz) +- [Design and Implementation](#design) + - [TraceZ](#tracez-details) + - [TraceConfigZ](#traceconfigz-details) + - [RPCz](#rpcz-details) + - [StatsZ](#statsz-details) + - [Shared zPages Components](#shared-zpages-components) + - [Wrapper](#wrapper) + - [HTTP Server](#http-server) +- [Future possibilities / Exploration](#future-possibilities) + - [Out-process Implementation](#out-process) + - [Shared Static Files](#shared-static-files) + +## Overview +zPages are webpages that allow easy viewing of tracing/metrics information. When included for a process, zPages will display basic information about that process on a webpage. + +The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about it [here](https://opencensus.io/zpages). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. + +While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below + +## Types of zPages +### TraceZ +TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. It also allows users to look closer at details for spans that are sampled. + +This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns, like seeing what speeds are typical for operations with a given span name. + +You can read about TraceZ more [here](https://opencensus.io/zpages/java/#tracez). + +### TraceConfigZ + +TraceConfigZ is closely related to TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly. + +For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attritubtes, annotations, events, and links can also be adjusted. + +You can read about TraceConfigZ more [here](https://opencensus.io/zpages/java/#traceconfigz). + +### RPCz +RPCz provides details on sent and received RPC messages, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second. + +You can read about RPCz more [here](https://docs.google.com/document/d/1RWNyUIaKTYK12tck_rQjki4jTyHFfkD8sk54mjwRwso/edit#) and [here](https://opencensus.io/zpages/java/#rpcz). + +### StatsZ +StatsZ is focused more on metrics, displays stats and measues for any exported views. These views are grouped into directories using their namespaces + +You can read more about StatZ [here](https://opencensus.io/zpages/java/#statsz) + +## Design and Implementation +### TraceZ Details +To implement TraceZ, spans need to be collected, aggregated, and rendered on a webpage. + +For OpenTelemetry, a custom `span processor` can be made to interface with the `Tracer` API to collect spans. This span processor collects references to running spans and exports completed spans to its own memory or directly to an aggregator. An alternative to a span processor is using some sort of profiler. + +A `data aggregator` keeps a track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55). + +When the user visits the TraceZ endpoint, likely something similar to host:port/tracez, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation. + +For all of these, the thread safety of all of these components needs to be taken into account. With a span processor, data aggregator, and HTTP server configuration, there needs to be tests that ensure correct, deterministic, and safe behavior when the different components try to access the same data structures concurrently. Additionally, the span data itself needs to be thread-safe since those fields will be accessed or copied in the aggregator and server level. + +### TraceConfigz Details +> TODO + +### RPCz Details +> TODO + +### StatsZ Details +> TODO + +## Shared zPages Components +### Wrapper +Each implementation ideally creates a wrapper class for zPages that allows users to add them all with minimal effort like an API, which could be as simple as adding 2 lines of code to include zPages. (importing zPages and initializing it). + +This wrapper class acts as an injection point, running all of the different zPages in a background thread without users needing knowledge of how they work under the hood. An example: if a language has TraceZ and RPCz implemented, then the wrapper will spin up servers for both when a user constructs a zPages class instance. Adding StatsZ would mean those developers would only add additional code to extract the needed information and display them on a page. + +### HTTP Server +All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed. + +Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means it would simply resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Depending on the type of zPages, a pure server-side approach uses data to generate HTML pages using hardcoded strings from scratch or using a template. All zPages need some server-side rendering. + +Optionally, there could also be an API layer that translates native data structures to JSON strings for a frontend to use when designated endpoints are accessed. This API layer would be paired with a frontend that provides client-side functionality, which would need Javascript. The frontend Javascript would use the API by requesting information at endpoints to add updates to the HTML DOM without unnecessarily requesting and re-rendering static resources. This makes initial page loads quicker and requires no knowledge of client-side rendering. + +------------- +## Future Possibilities +### Out-process +zPages that compatible across different languages, with processing of information happening outside of applications +- Pros + - zPages can be added to any OpenTelemetry repository, and future development can be completely focused here instead +- Cons + - More complicated than using local methods, and requires extra setup (i.e. RPC communication setup) in applications to somehow send information to zPages to work + +### Shared Static Files +All HTML, CSS, and Javascript files would be used across different OT language repositories for their in-process zPages +- Pros + - When client-side features are rolled out (including filtering/sorting data, interval refreshing, unit toggles), changes are all centralized + - Rendering logic and responsibility is focused and can be more effective, zPages developers can focus on other priorities + - Less difficult to share frontend information post-setup, follows OpenTelemetry's philosophy of being standardized +- Cons + - Adds computation of converting native data structures into JSON strings and serving these static files. May need extra libraries + - Some process has to be created to update the static files in a repository and serving them at the correct endpoints + - Initial setup may be difficult (one way this can be achieved is with Github modules) + +- Proxy/Shim layer + + +> TODO: Link spec where possible, Add pictures, design docs links + From 26b66acd790eca269959944b853609b4cb469062 Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 13:32:05 -0400 Subject: [PATCH 03/17] Formatting, todo details --- experimental/zpages.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index ea5f99a6cbe..6a84d2f3eec 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -105,8 +105,8 @@ All HTML, CSS, and Javascript files would be used across different OT language r - Some process has to be created to update the static files in a repository and serving them at the correct endpoints - Initial setup may be difficult (one way this can be achieved is with Github modules) -- Proxy/Shim layer - +### Proxy/Shim layer +> TODO -> TODO: Link spec where possible, Add pictures, design docs links +> GENERAL TODO: Link spec where possible, add pictures/figures and design docs links From d8e31eb49d4ac25033491746f5dab0c164ec6002 Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 13:45:14 -0400 Subject: [PATCH 04/17] Fix link --- experimental/zpages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 6a84d2f3eec..0169e1a189b 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -1,7 +1,7 @@ # zPages ## Table of Contents - [Overview](#overview) - - [Types of zPages](types-of-zpages) + - [Types of zPages](#types-of-zpages) - [TraceZ](#tracez) - [TraceConfigz](#traceconfigz) - [RPCz](#rpcz) From 146420a0b99951ffdd17ee665a3a3de02311334b Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 13:47:16 -0400 Subject: [PATCH 05/17] Fix design link --- experimental/zpages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 0169e1a189b..19b27890b0b 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -6,7 +6,7 @@ - [TraceConfigz](#traceconfigz) - [RPCz](#rpcz) - [StatsZ](#statsz) -- [Design and Implementation](#design) +- [Design and Implementation](#design-and-implementation) - [TraceZ](#tracez-details) - [TraceConfigZ](#traceconfigz-details) - [RPCz](#rpcz-details) From e4744fcdf90aa2ba89f787e1e822202d1a230c45 Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 14:05:17 -0400 Subject: [PATCH 06/17] Fix some typos, add otep link --- experimental/zpages.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 19b27890b0b..af55ffb8938 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -21,15 +21,15 @@ ## Overview zPages are webpages that allow easy viewing of tracing/metrics information. When included for a process, zPages will display basic information about that process on a webpage. -The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about it [here](https://opencensus.io/zpages). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. +The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. -While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below +While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below. ## Types of zPages ### TraceZ TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. It also allows users to look closer at details for spans that are sampled. -This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns, like seeing what speeds are typical for operations with a given span name. +This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. You can read about TraceZ more [here](https://opencensus.io/zpages/java/#tracez). @@ -47,9 +47,9 @@ RPCz provides details on sent and received RPC messages, which is categorized b You can read about RPCz more [here](https://docs.google.com/document/d/1RWNyUIaKTYK12tck_rQjki4jTyHFfkD8sk54mjwRwso/edit#) and [here](https://opencensus.io/zpages/java/#rpcz). ### StatsZ -StatsZ is focused more on metrics, displays stats and measues for any exported views. These views are grouped into directories using their namespaces +StatsZ is focused more on metrics, as it displays stats and measues for exported views. These views are grouped into directories using their namespaces -You can read more about StatZ [here](https://opencensus.io/zpages/java/#statsz) +You can read more about StatsZ [here](https://opencensus.io/zpages/java/#statsz) ## Design and Implementation ### TraceZ Details @@ -81,7 +81,7 @@ This wrapper class acts as an injection point, running all of the different zPag ### HTTP Server All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed. -Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means it would simply resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Depending on the type of zPages, a pure server-side approach uses data to generate HTML pages using hardcoded strings from scratch or using a template. All zPages need some server-side rendering. +Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means the server would only serve statuc resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template. All zPages need some server-side rendering. Optionally, there could also be an API layer that translates native data structures to JSON strings for a frontend to use when designated endpoints are accessed. This API layer would be paired with a frontend that provides client-side functionality, which would need Javascript. The frontend Javascript would use the API by requesting information at endpoints to add updates to the HTML DOM without unnecessarily requesting and re-rendering static resources. This makes initial page loads quicker and requires no knowledge of client-side rendering. From 83f2aad34f4dc497fdad5f144e1d2f15170769f5 Mon Sep 17 00:00:00 2001 From: jajanet Date: Fri, 7 Aug 2020 14:26:41 -0400 Subject: [PATCH 07/17] OT->OTel, other detail cleanup --- experimental/zpages.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index af55ffb8938..7bb1c20f294 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -19,7 +19,7 @@ - [Shared Static Files](#shared-static-files) ## Overview -zPages are webpages that allow easy viewing of tracing/metrics information. When included for a process, zPages will display basic information about that process on a webpage. +zPages are an alternative to external exporters, calculating and serving in-process webpages that show various tracing and metrics information when included. The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. @@ -95,7 +95,7 @@ zPages that compatible across different languages, with processing of informatio - More complicated than using local methods, and requires extra setup (i.e. RPC communication setup) in applications to somehow send information to zPages to work ### Shared Static Files -All HTML, CSS, and Javascript files would be used across different OT language repositories for their in-process zPages +All HTML, CSS, and Javascript files would be used across different OTel language repositories for their in-process zPages - Pros - When client-side features are rolled out (including filtering/sorting data, interval refreshing, unit toggles), changes are all centralized - Rendering logic and responsibility is focused and can be more effective, zPages developers can focus on other priorities From 628c50aa93733fa5a47f0f53fa339d72aa69ac86 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Tue, 11 Aug 2020 20:37:45 -0400 Subject: [PATCH 08/17] Define zpages deadlock more accuractely Co-authored-by: Sergey Kanzhelev --- experimental/zpages.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 7bb1c20f294..c6048004bc2 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -29,7 +29,7 @@ While zPages are uniquely useful in being more lightweight and quicker compared ### TraceZ TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. It also allows users to look closer at details for spans that are sampled. -This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. +This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (e.g. running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. You can read about TraceZ more [here](https://opencensus.io/zpages/java/#tracez). @@ -109,4 +109,3 @@ All HTML, CSS, and Javascript files would be used across different OTel language > TODO > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links - From 91d29bc71b9f66d1ef7ea8cb1a100c07610f2649 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Tue, 11 Aug 2020 21:00:42 -0400 Subject: [PATCH 09/17] Add suggestions and clarifications --- experimental/zpages.md | 47 +++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index c6048004bc2..e35b296be2b 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -19,37 +19,30 @@ - [Shared Static Files](#shared-static-files) ## Overview -zPages are an alternative to external exporters, calculating and serving in-process webpages that show various tracing and metrics information when included. +zPages are an in-process alternative to external exporters. When included, they collect and aggregate tracing and metrics information in the background; this data is served on webpages when requested. The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. -While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below. +zPages are uniquely useful in a couple of different ways. One is that they're more lightweight and quicker compared to installing external tracing systems like Jaeger and Zipkin, yet they still share useful ways to debug and gain insight into instrumented applications; these uses depend on the type of zPage, which is detailed below. For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send subset of telemetry for reach analysis to save costs. ## Types of zPages ### TraceZ -TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. It also allows users to look closer at details for spans that are sampled. +TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps samples for for error, running, and each of the latency buckets for a given span names to allows users to look closer at span fields. This is particularlu useful compared to external exporters, which would otherwise likely sample them out. -This type of zPage is useful particularly for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (e.g. running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. - -You can read about TraceZ more [here](https://opencensus.io/zpages/java/#tracez). +This zPage is particularly useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. ### TraceConfigZ -TraceConfigZ is closely related to TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly. - -For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attritubtes, annotations, events, and links can also be adjusted. +TraceConfigZ is closely related to and requires implementation of TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly. -You can read about TraceConfigZ more [here](https://opencensus.io/zpages/java/#traceconfigz). +For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attritubtes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications. ### RPCz RPCz provides details on sent and received RPC messages, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second. -You can read about RPCz more [here](https://docs.google.com/document/d/1RWNyUIaKTYK12tck_rQjki4jTyHFfkD8sk54mjwRwso/edit#) and [here](https://opencensus.io/zpages/java/#rpcz). - ### StatsZ -StatsZ is focused more on metrics, as it displays stats and measues for exported views. These views are grouped into directories using their namespaces +StatsZ is focused on metrics, as it displays metrics and measures for exported views. These views are grouped into directories using their namespaces -You can read more about StatsZ [here](https://opencensus.io/zpages/java/#statsz) ## Design and Implementation ### TraceZ Details @@ -57,9 +50,9 @@ To implement TraceZ, spans need to be collected, aggregated, and rendered on a w For OpenTelemetry, a custom `span processor` can be made to interface with the `Tracer` API to collect spans. This span processor collects references to running spans and exports completed spans to its own memory or directly to an aggregator. An alternative to a span processor is using some sort of profiler. -A `data aggregator` keeps a track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55). +A `data aggregator` keeps track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55). -When the user visits the TraceZ endpoint, likely something similar to host:port/tracez, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation. +When the user visits the TraceZ endpoint, likely something similar to `host:port/tracez`, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation. For all of these, the thread safety of all of these components needs to be taken into account. With a span processor, data aggregator, and HTTP server configuration, there needs to be tests that ensure correct, deterministic, and safe behavior when the different components try to access the same data structures concurrently. Additionally, the span data itself needs to be thread-safe since those fields will be accessed or copied in the aggregator and server level. @@ -74,25 +67,33 @@ For all of these, the thread safety of all of these components needs to be taken ## Shared zPages Components ### Wrapper -Each implementation ideally creates a wrapper class for zPages that allows users to add them all with minimal effort like an API, which could be as simple as adding 2 lines of code to include zPages. (importing zPages and initializing it). +A zPages wrapper class acts as an API or injection point for zPages, instantiating and running all of the different zPages in the background without users needing knowledge of how they work under the hood. + +An example of what happens when a user includes a wrapper: if OTel Python has TraceZ and RPCz implemented and added to that wrapper, that wrapper will create instances of all the needed components (processors, aggregators, etc) for both zPages when zPages is initialized. If or when other zPages are added to OTel Python, developers adding them would only need to add the corresponding initialization code for those components in the wrapper. -This wrapper class acts as an injection point, running all of the different zPages in a background thread without users needing knowledge of how they work under the hood. An example: if a language has TraceZ and RPCz implemented, then the wrapper will spin up servers for both when a user constructs a zPages class instance. Adding StatsZ would mean those developers would only add additional code to extract the needed information and display them on a page. +Each zPages implementation ideally creates a wrapper class for zPages, since they would allow users to add all zPages with minimal effort. These wrappers should be as simple as adding 2 lines of code to include zPages (zPages import + initialization line). ### HTTP Server All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed. -Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means the server would only serve statuc resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template. All zPages need some server-side rendering. +Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. + +All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. + +Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and former for user interactions. + +The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render update to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. -Optionally, there could also be an API layer that translates native data structures to JSON strings for a frontend to use when designated endpoints are accessed. This API layer would be paired with a frontend that provides client-side functionality, which would need Javascript. The frontend Javascript would use the API by requesting information at endpoints to add updates to the HTML DOM without unnecessarily requesting and re-rendering static resources. This makes initial page loads quicker and requires no knowledge of client-side rendering. +In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in a existing server. For example, this can be done in Java by calling this zPage logic from a servlet. ------------- ## Future Possibilities ### Out-process -zPages that compatible across different languages, with processing of information happening outside of applications +Out-process zPages are ones that are compatible across different languages, which executes process of tracing and metrics information outside of applications (as opposed to in-process zPages). - Pros - - zPages can be added to any OpenTelemetry repository, and future development can be completely focused here instead + - zPages can be added to any OpenTelemetry repository, and future development can be completely focused there - Cons - - More complicated than using local methods, and requires extra setup (i.e. RPC communication setup) in applications to somehow send information to zPages to work + - More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communication to work. This would make it similar to other tracing systems. ### Shared Static Files All HTML, CSS, and Javascript files would be used across different OTel language repositories for their in-process zPages From 385485b09772217779f9e974f9af52a1a2a391b0 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Tue, 11 Aug 2020 21:23:30 -0400 Subject: [PATCH 10/17] Fix typos and grammar errors --- experimental/zpages.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index e35b296be2b..dbfcf8f0299 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -27,7 +27,7 @@ zPages are uniquely useful in a couple of different ways. One is that they're mo ## Types of zPages ### TraceZ -TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps samples for for error, running, and each of the latency buckets for a given span names to allows users to look closer at span fields. This is particularlu useful compared to external exporters, which would otherwise likely sample them out. +TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps samples for for error, running, and each of the latency buckets for given span names to allow users to look closer at span fields. This is particularly useful compared to external exporters, which would otherwise likely sample them out. This zPage is particularly useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. @@ -35,7 +35,7 @@ This zPage is particularly useful for debugging latency issues (slow parts of ap TraceConfigZ is closely related to and requires implementation of TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly. -For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attritubtes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications. +For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attributes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications. ### RPCz RPCz provides details on sent and received RPC messages, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second. @@ -80,20 +80,20 @@ Traditionally, zPages have approached this by rendering webpages purely on the s All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. -Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and former for user interactions. +Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render update to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. -In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in a existing server. For example, this can be done in Java by calling this zPage logic from a servlet. +In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling this zPage logic from a servlet. ------------- ## Future Possibilities ### Out-process -Out-process zPages are ones that are compatible across different languages, which executes process of tracing and metrics information outside of applications (as opposed to in-process zPages). +Out-process zPages are ones that are compatible across different languages, which executes the processing of tracing and metrics information outside of applications (as opposed to in-process zPages). - Pros - zPages can be added to any OpenTelemetry repository, and future development can be completely focused there - Cons - - More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communication to work. This would make it similar to other tracing systems. + - More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communicate. This would make it similar to other tracing systems. ### Shared Static Files All HTML, CSS, and Javascript files would be used across different OTel language repositories for their in-process zPages From 92ff8c9e6240b522eb9a4422d10f79be0b7d9c41 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Tue, 11 Aug 2020 21:27:24 -0400 Subject: [PATCH 11/17] Update tracez sentence flow --- experimental/zpages.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index dbfcf8f0299..0973b0c802e 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -27,7 +27,7 @@ zPages are uniquely useful in a couple of different ways. One is that they're mo ## Types of zPages ### TraceZ -TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps samples for for error, running, and each of the latency buckets for given span names to allow users to look closer at span fields. This is particularly useful compared to external exporters, which would otherwise likely sample them out. +TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out. This zPage is particularly useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. From a68bfe9d21cd0594be616518bf47d6eb71fbf28e Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Wed, 12 Aug 2020 08:53:44 -0400 Subject: [PATCH 12/17] Fix grammar issues, mention OTel collector zpages --- experimental/zpages.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 0973b0c802e..645677a3c7b 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -19,17 +19,17 @@ - [Shared Static Files](#shared-static-files) ## Overview -zPages are an in-process alternative to external exporters. When included, they collect and aggregate tracing and metrics information in the background; this data is served on webpages when requested. +zPages are an in-process alternative to external exporters. When included, they collect and aggregate tracing and metrics information in the background; this data is served on web pages when requested. -The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. +The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. The OTel Collector also has [an implementation](https://github.com/open-telemetry/opentelemetry-collector/tree/master/extension/zpagesextension) of zPages. -zPages are uniquely useful in a couple of different ways. One is that they're more lightweight and quicker compared to installing external tracing systems like Jaeger and Zipkin, yet they still share useful ways to debug and gain insight into instrumented applications; these uses depend on the type of zPage, which is detailed below. For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send subset of telemetry for reach analysis to save costs. +zPages are uniquely useful in a couple of different ways. One is that they're more lightweight and quicker compared to installing external tracing systems like Jaeger and Zipkin, yet they still share useful ways to debug and gain insight into instrumented applications; these uses depend on the type of zPage, which is detailed below. For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send a subset of telemetry for reach analysis to save costs. ## Types of zPages ### TraceZ TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out. -This zPage is particularly useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where error happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. +This zPage is also useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where errors happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. ### TraceConfigZ @@ -76,13 +76,13 @@ Each zPages implementation ideally creates a wrapper class for zPages, since the ### HTTP Server All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed. -Traditionally, zPages have approached this by rendering webpages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. +Traditionally, zPages have approached this by rendering web pages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. -The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render update to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. +The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling this zPage logic from a servlet. @@ -110,3 +110,6 @@ All HTML, CSS, and Javascript files would be used across different OTel language > TODO > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links + + + From dfeef42cdf16bf7592a1d0393c4f1e0510aca220 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Mon, 17 Aug 2020 12:23:50 -0400 Subject: [PATCH 13/17] Add security concerns, data url endpoints TODO --- experimental/zpages.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 645677a3c7b..cface0416b2 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -80,11 +80,12 @@ Traditionally, zPages have approached this by rendering web pages purely on the All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. -Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. +Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to a web-compatible string format (e.g. JSON). Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. This separation also allows users to access zPages data, such as when using wget on the latter endpoints, without HTML/CSS.Javascript. +> TODO: data endpoints for serving zPage data should be standardized and documented here (i.e. URL and data formatting, required/optional parameters) The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. -In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling this zPage logic from a servlet. +In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling zPages logic from a servlet. It's also worth noting that having zPages in an embedded HTTP server increases the vulnerability of application by increasing its attack surface area. A malicious actor could potentially read sensistive data in telemetry, perform DOS attacks on the HTTP server, or initiate a telemetry storm by reconfiguring how telemetry is collected (i.e. through TraceConfigZ); zPages should be reserved for protected dev environments for most cases because of this. ------------- ## Future Possibilities @@ -110,6 +111,3 @@ All HTML, CSS, and Javascript files would be used across different OTel language > TODO > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links - - - From d7587d88900ff8ef25f11b0840e50200ce393298 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Mon, 17 Aug 2020 12:28:21 -0400 Subject: [PATCH 14/17] Fix typo, uppercasing --- experimental/zpages.md | 134 ++++++++++++++++++++--------------------- 1 file changed, 67 insertions(+), 67 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index cface0416b2..375f9edc2d8 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -1,113 +1,113 @@ # zPages ## Table of Contents - [Overview](#overview) - - [Types of zPages](#types-of-zpages) - - [TraceZ](#tracez) - - [TraceConfigz](#traceconfigz) - - [RPCz](#rpcz) - - [StatsZ](#statsz) + - [Types of zPages](#types-of-zpages) + - [Tracez](#tracez) + - [TraceConfigz](#traceconfigz) + - [RPCz](#rpcz) + - [Statsz](#statsz) - [Design and Implementation](#design-and-implementation) - - [TraceZ](#tracez-details) - - [TraceConfigZ](#traceconfigz-details) - - [RPCz](#rpcz-details) - - [StatsZ](#statsz-details) - - [Shared zPages Components](#shared-zpages-components) - - [Wrapper](#wrapper) - - [HTTP Server](#http-server) + - [Tracez](#tracez-details) + - [TraceConfigz](#traceconfigz-details) + - [RPCz](#rpcz-details) + - [Statsz](#statsz-details) + - [Shared zPages Components](#shared-zpages-components) + - [Wrapper](#wrapper) + - [HTTP Server](#http-server) - [Future possibilities / Exploration](#future-possibilities) - - [Out-process Implementation](#out-process) - - [Shared Static Files](#shared-static-files) - + - [Out-process Implementation](#out-process) + - [Shared Static Files](#shared-static-files) + ## Overview zPages are an in-process alternative to external exporters. When included, they collect and aggregate tracing and metrics information in the background; this data is served on web pages when requested. - + The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. The OTel Collector also has [an implementation](https://github.com/open-telemetry/opentelemetry-collector/tree/master/extension/zpagesextension) of zPages. - + zPages are uniquely useful in a couple of different ways. One is that they're more lightweight and quicker compared to installing external tracing systems like Jaeger and Zipkin, yet they still share useful ways to debug and gain insight into instrumented applications; these uses depend on the type of zPage, which is detailed below. For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send a subset of telemetry for reach analysis to save costs. - + ## Types of zPages -### TraceZ -TraceZ shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, TraceZ also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out. - +### Tracez +Tracez shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, Tracez also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out. + This zPage is also useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where errors happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name. - -### TraceConfigZ - -TraceConfigZ is closely related to and requires implementation of TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly. - + +### TraceConfigz + +TraceConfigz is closely related to and requires implementation of Tracez, allowing the user to modify how spans are sampled or how much data to keep in Tracez by updating the Tracez components accordingly. + For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attributes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications. - + ### RPCz RPCz provides details on sent and received RPC messages, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second. - -### StatsZ -StatsZ is focused on metrics, as it displays metrics and measures for exported views. These views are grouped into directories using their namespaces - - + +### Statsz +Statsz is focused on metrics, as it displays metrics and measures for exported views. These views are grouped into directories using their namespaces + + ## Design and Implementation -### TraceZ Details -To implement TraceZ, spans need to be collected, aggregated, and rendered on a webpage. - +### Tracez Details +To implement Tracez, spans need to be collected, aggregated, and rendered on a webpage. + For OpenTelemetry, a custom `span processor` can be made to interface with the `Tracer` API to collect spans. This span processor collects references to running spans and exports completed spans to its own memory or directly to an aggregator. An alternative to a span processor is using some sort of profiler. - + A `data aggregator` keeps track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55). - -When the user visits the TraceZ endpoint, likely something similar to `host:port/tracez`, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation. - + +When the user visits the Tracez endpoint, likely something similar to `host:port/tracez`, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation. + For all of these, the thread safety of all of these components needs to be taken into account. With a span processor, data aggregator, and HTTP server configuration, there needs to be tests that ensure correct, deterministic, and safe behavior when the different components try to access the same data structures concurrently. Additionally, the span data itself needs to be thread-safe since those fields will be accessed or copied in the aggregator and server level. - + ### TraceConfigz Details > TODO - + ### RPCz Details > TODO - -### StatsZ Details + +### Statsz Details > TODO - + ## Shared zPages Components ### Wrapper A zPages wrapper class acts as an API or injection point for zPages, instantiating and running all of the different zPages in the background without users needing knowledge of how they work under the hood. - -An example of what happens when a user includes a wrapper: if OTel Python has TraceZ and RPCz implemented and added to that wrapper, that wrapper will create instances of all the needed components (processors, aggregators, etc) for both zPages when zPages is initialized. If or when other zPages are added to OTel Python, developers adding them would only need to add the corresponding initialization code for those components in the wrapper. - + +An example of what happens when a user includes a wrapper: if OTel Python has Tracez and RPCz implemented and added to that wrapper, that wrapper will create instances of all the needed components (processors, aggregators, etc) for both zPages when zPages is initialized. If or when other zPages are added to OTel Python, developers adding them would only need to add the corresponding initialization code for those components in the wrapper. + Each zPages implementation ideally creates a wrapper class for zPages, since they would allow users to add all zPages with minimal effort. These wrappers should be as simple as adding 2 lines of code to include zPages (zPages import + initialization line). - + ### HTTP Server All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed. - + Traditionally, zPages have approached this by rendering web pages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. - + All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. - + Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to a web-compatible string format (e.g. JSON). Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. This separation also allows users to access zPages data, such as when using wget on the latter endpoints, without HTML/CSS.Javascript. > TODO: data endpoints for serving zPage data should be standardized and documented here (i.e. URL and data formatting, required/optional parameters) - -The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. - -In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling zPages logic from a servlet. It's also worth noting that having zPages in an embedded HTTP server increases the vulnerability of application by increasing its attack surface area. A malicious actor could potentially read sensistive data in telemetry, perform DOS attacks on the HTTP server, or initiate a telemetry storm by reconfiguring how telemetry is collected (i.e. through TraceConfigZ); zPages should be reserved for protected dev environments for most cases because of this. - + +The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. + +In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling zPages logic from a servlet. It's also worth noting that having zPages in an embedded HTTP server increases the vulnerability of application and security risks by increasing its attack surface area. A malicious actor could potentially read sensitive data in telemetry, perform DOS attacks on the HTTP server, or initiate a telemetry storm by reconfiguring how telemetry is collected (i.e. through TraceConfigz); zPages should be reserved for protected dev environments for most cases because of this. + ------------- ## Future Possibilities ### Out-process Out-process zPages are ones that are compatible across different languages, which executes the processing of tracing and metrics information outside of applications (as opposed to in-process zPages). - Pros - - zPages can be added to any OpenTelemetry repository, and future development can be completely focused there + - zPages can be added to any OpenTelemetry repository, and future development can be completely focused there - Cons - - More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communicate. This would make it similar to other tracing systems. - + - More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communicate. This would make it similar to other tracing systems. + ### Shared Static Files All HTML, CSS, and Javascript files would be used across different OTel language repositories for their in-process zPages - Pros - - When client-side features are rolled out (including filtering/sorting data, interval refreshing, unit toggles), changes are all centralized - - Rendering logic and responsibility is focused and can be more effective, zPages developers can focus on other priorities - - Less difficult to share frontend information post-setup, follows OpenTelemetry's philosophy of being standardized + - When client-side features are rolled out (including filtering/sorting data, interval refreshing, unit toggles), changes are all centralized + - Rendering logic and responsibility is focused and can be more effective, zPages developers can focus on other priorities + - Less difficult to share frontend information post-setup, follows OpenTelemetry's philosophy of being standardized - Cons - - Adds computation of converting native data structures into JSON strings and serving these static files. May need extra libraries - - Some process has to be created to update the static files in a repository and serving them at the correct endpoints - - Initial setup may be difficult (one way this can be achieved is with Github modules) - + - Adds computation of converting native data structures into JSON strings and serving these static files. May need extra libraries + - Some process has to be created to update the static files in a repository and serving them at the correct endpoints + - Initial setup may be difficult (one way this can be achieved is with Github modules) + ### Proxy/Shim layer > TODO - + > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links From c819a4bc8cd2851e57c34425c2b76fed0be55435 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Mon, 17 Aug 2020 12:38:01 -0400 Subject: [PATCH 15/17] Reclarify TraceConfigz --- experimental/zpages.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/experimental/zpages.md b/experimental/zpages.md index 375f9edc2d8..68d1c1b3e7b 100644 --- a/experimental/zpages.md +++ b/experimental/zpages.md @@ -33,8 +33,8 @@ This zPage is also useful for debugging latency issues (slow parts of applicatio ### TraceConfigz -TraceConfigz is closely related to and requires implementation of Tracez, allowing the user to modify how spans are sampled or how much data to keep in Tracez by updating the Tracez components accordingly. - +TraceConfigz allows the user to control how spans are sampled for both zPages and external backends. + For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attributes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications. ### RPCz From 2243d64ea94c4fb1b401e97edab5897bbbbbe813 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Tue, 18 Aug 2020 21:00:55 +0000 Subject: [PATCH 16/17] Nest zPages.md in trace folder, clarify ui/data separation --- experimental/{ => trace}/zpages.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) rename experimental/{ => trace}/zpages.md (94%) diff --git a/experimental/zpages.md b/experimental/trace/zpages.md similarity index 94% rename from experimental/zpages.md rename to experimental/trace/zpages.md index 68d1c1b3e7b..8e9f78f88a8 100644 --- a/experimental/zpages.md +++ b/experimental/trace/zpages.md @@ -78,9 +78,10 @@ All zPages have some sort of HTTP Server component to render their information o Traditionally, zPages have approached this by rendering web pages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. -All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. - -Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to a web-compatible string format (e.g. JSON). Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. This separation also allows users to access zPages data, such as when using wget on the latter endpoints, without HTML/CSS.Javascript. +All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. This separation has benefits including 1.) allowing users to access isolayed zPages data, such as when using wget on the latter endpoints, without HTML/CSS/Javascript nd 2.) adding extensibility to zPages (e.g. the frontend can be centralized and used in multiple OTel language repositories). This approach is detailed below. + +Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to JSON. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. If the client requests the data via a request parameter or "Accept" HTTP header, that data should be available as a JSON-encoded response. + > TODO: data endpoints for serving zPage data should be standardized and documented here (i.e. URL and data formatting, required/optional parameters) The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. @@ -111,3 +112,4 @@ All HTML, CSS, and Javascript files would be used across different OTel language > TODO > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links + From ca0ec5da187d5053bbe695e426647bdc3e2a2651 Mon Sep 17 00:00:00 2001 From: Janet Vu Date: Wed, 19 Aug 2020 21:39:49 +0000 Subject: [PATCH 17/17] More cleanup and typo fixes --- experimental/trace/zpages.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/experimental/trace/zpages.md b/experimental/trace/zpages.md index 8e9f78f88a8..6cabd9d6caa 100644 --- a/experimental/trace/zpages.md +++ b/experimental/trace/zpages.md @@ -78,11 +78,11 @@ All zPages have some sort of HTTP Server component to render their information o Traditionally, zPages have approached this by rendering web pages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer. -All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. This separation has benefits including 1.) allowing users to access isolayed zPages data, such as when using wget on the latter endpoints, without HTML/CSS/Javascript nd 2.) adding extensibility to zPages (e.g. the frontend can be centralized and used in multiple OTel language repositories). This approach is detailed below. +All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. This separation has benefits including 1.) allowing users to access isolated zPages data, such as when using wget on the endpoints serving JSON data, without HTML/CSS/Javascript and 2.) adding extensibility to zPages (e.g. the frontend can be centralized and used in multiple OTel language repositories). This approach is detailed below. Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to JSON. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. If the client requests the data via a request parameter or "Accept" HTTP header, that data should be available as a JSON-encoded response. -> TODO: data endpoints for serving zPage data should be standardized and documented here (i.e. URL and data formatting, required/optional parameters) +> TODO: add standardized URL endpoints for serving zPage data, along with expected JSON formatting and required/optional parameters The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer. @@ -113,3 +113,4 @@ All HTML, CSS, and Javascript files would be used across different OTel language > GENERAL TODO: Link spec where possible, add pictures/figures and design docs links +