Skip to content

Commit

Permalink
Merge branch 'master' into probrequirements
Browse files Browse the repository at this point in the history
  • Loading branch information
bogdandrutu authored Aug 20, 2020
2 parents 1c12035 + 5b86d4b commit 170f380
Show file tree
Hide file tree
Showing 23 changed files with 403 additions and 56 deletions.
9 changes: 5 additions & 4 deletions .github/workflows/auto-assign-tc-members.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
name: 'Auto Assign'
on:
pull_request:
types: [assigned, opened, synchronize, reopened]
on:
pull_request_target:
types: [opened, reopened]

jobs:
add-owner:
runs-on: ubuntu-latest
steps:
- name: run
uses: kentaro-m/auto-assign-action@v1.1.1
with:
with:
configuration-path: ".github/auto_assign.yml"
repo-token: '${{ secrets.GITHUB_TOKEN }}'
5 changes: 3 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# All documents to be used in spell check.
ALL_DOCS := $(shell find . -name '*.md' -type f | grep -v ^./node_modules | sort)
ALL_DOCS := $(shell find . -name '*.md' -not -path './.github/*' -type f | grep -v ^./node_modules | sort)

TOOLS_DIR := ./.tools
MISSPELL_BINARY=$(TOOLS_DIR)/misspell
Expand Down Expand Up @@ -32,4 +32,5 @@ install-markdown-lint:

.PHONY: markdown-lint
markdown-lint:
@for f in $(ALL_DOCS); do echo $$f; $(MARKDOWN_LINT) -c .markdownlint.yaml $$f; done
@echo $(ALL_DOCS)
@for f in $(ALL_DOCS); do echo $$f; $(MARKDOWN_LINT) -c .markdownlint.yaml $$f || exit 1; done
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The OpenTelemetry specification describes the cross-language requirements and ex
- [Metrics](specification/metrics/api.md)
- SDK Specification
- [Tracing](specification/trace/sdk.md)
- [Metrics](specification/metrics/sdk.md)
- [Resource](specification/resource/sdk.md)
- [Configuration](specification/sdk-configuration.md)
- Data Specification
Expand Down
20 changes: 17 additions & 3 deletions experimental/metrics/config-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@

<small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small>


## Overview

The OpenTelemetry Metric Configuration Service adds the ability to dynamically
and remotely configure metric collection schedules. A user may specify
collection periods at runtime, and propagate these changes to instrumented
Expand All @@ -35,8 +35,8 @@ third-party metric provider has an existing metric configuration service (or
would like to implement one in the future), and if it communicates using this
protocol, it may speak directly with our instrumented applications.


## Service Protocol

Configuration data is communicated between an SDK and a backend (either directly
or indirectly through a Collector) using the following protocol specification.
The SDK is assumed to be the client, and makes the metric config requests. The
Expand All @@ -45,6 +45,7 @@ responses. For more details on this arrangement, see
[below](#push-vs-pull-metric-model).

### Metric Config Request

A request consists of two fields: `resource` and an optional
`last_known_fingerprint`.

Expand All @@ -63,10 +64,12 @@ If unspecified, the configuration backend will send the full schedules with each
request.

### Metric Config Response

A response consists of three fields `schedules`, `fingerprint`, and
`suggested_wait_time_sec`.

#### Schedules

`schedules` is a list of metric schedules. Each schedule consists of three
components: `exclusion_patterns`, `inclusion_patterns`, and `period_sec`.

Expand All @@ -85,6 +88,7 @@ periods that are divisible by the smallest period (see
collected

#### Fingerprint

`fingerprint` is a sequence of bytes that corresponds to the set of schedules
being sent. There are two requirements on computing fingerprints:

Expand All @@ -97,12 +101,14 @@ is the same as the response’s `last_known_fingerprint`, then all other fields
the response are optional.

#### Wait Time

`suggested_wait_time_sec` is a duration (in seconds) that the SDK should wait
before sending the next metric config request. A response MAY have a
`suggested_wait_time_sec`, but its use is optional, and the SDK need not obey
it. As the name implies, it is simply a suggestion.

### Push vs Pull Metric Model

Note that the configuration service assumes a “push” model of metric export --
that is, metrics are pushed from the SDK to a receiving backend. The backend
serves incoming requests that contain metric data. This is in contrast to the
Expand All @@ -114,8 +120,8 @@ metrics, and the need for our configuration service is less relevant. We
therefore assume that all systems using the configuration service deliver
metrics on a push-based model.


## Implementation Details

Because this specification is experimental, and may imply substantial changes to
the existing system, we provide additional details on the example prototype
implementations available on the
Expand All @@ -125,6 +131,7 @@ actual implementation in an SDK will likely differ. We offer these details not
as formal specification, but as an example of how this system might look.

### Collection Periods

Though the protocol does not enforce specific collection periods, the SDK MAY
assume that all larger collection periods will be divisible by the smallest
period in a set of schedules, for the sake of optimization. Indeed, it is
Expand All @@ -149,6 +156,7 @@ However, the SDK MUST still be able to handle periods of any nonzero integer
duration, even if they violate the divisibility suggestion.

### Go SDK

A prototype implementation of metric configuration is available for the Go SDK,
currently hosted on the [contrib repo](https://github.com/vmingchen/opentelemetry-go-contrib). It provides an
alternative push controller component with the ability to configure collection
Expand All @@ -166,6 +174,7 @@ controller, in place of OpenTelemetry’s version, to be able to have access to
this feature.

### Collector Extension

An example configuration backend is implemented as an extension for the
Collector, currently hosted on the [contrib repo](https://github.com/vmingchen/opentelemetry-collector-contrib). When this extension is enabled, the Collector
functions as a potential endpoint for Agent/SDKs to retrieve configuration data.
Expand All @@ -184,6 +193,7 @@ The configuration data itself may be specified using one of two sources: a local
file or a connection to a remote backend.

#### Local File

Configuration data can be specified in the form of a local file that the
collector monitors. Changes to the file are immediately reflected in the
Collector’s in-memory representation of the data, so there is no need to restart
Expand Down Expand Up @@ -213,6 +223,7 @@ ConfigBlocks:
```
The following rules govern the file-based configurations:
* There MUST be 1 ConfigBlock or more in a ConfigBlocks list
* Each ConfigBlock MAY have a field Resource
* Resource MAY have one or more strings, each a string-representation of a key-value label in a resource. If no strings are specified, then this ConfigBlock matches with any request
Expand All @@ -222,6 +233,7 @@ The following rules govern the file-based configurations:
* Each Schedule MUST have a field Period, corresponding to the collection period of the metrics matched by this Schedule
##### Matching Behavior
An incoming request specifies a resource for which configuration data should be
returned. A ConfigBlock matches a resource if all strings listed under
ConfigBlock::Resource exactly equal a key-value label in the resource. In the
Expand All @@ -236,11 +248,13 @@ across telemetry sources, unless superseded by a more specific ConfigBlock that
asks for a shorter period.
##### Fingerprint Hashing
Fingerprints are generated using an FNVa 64 bit hashing scheme. The hash is
uniquely determined by the contents of a ConfigBlock. The order of patterns and
the order of schedules do not impact the resulting hash.
#### Remote Backend
Alternatively, instead of using a local file, the Collector may use another
configuration service backend. This remote backend could be another Collector,
or it could be a third party that implements the configuration service. In the
Expand Down
134 changes: 134 additions & 0 deletions experimental/trace/zpages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# zPages

## Table of Contents

- [Overview](#overview)
- [Types of zPages](#types-of-zpages)
- [Tracez](#tracez)
- [TraceConfigz](#traceconfigz)
- [RPCz](#rpcz)
- [Statsz](#statsz)
- [Design and Implementation](#design-and-implementation)
- [Tracez](#tracez-details)
- [TraceConfigz](#traceconfigz-details)
- [RPCz](#rpcz-details)
- [Statsz](#statsz-details)
- [Shared zPages Components](#shared-zpages-components)
- [Wrapper](#wrapper)
- [HTTP Server](#http-server)
- [Future possibilities / Exploration](#future-possibilities)
- [Out-process Implementation](#out-process)
- [Shared Static Files](#shared-static-files)

## Overview

zPages are an in-process alternative to external exporters. When included, they collect and aggregate tracing and metrics information in the background; this data is served on web pages when requested.

The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++. The OTel Collector also has [an implementation](https://github.com/open-telemetry/opentelemetry-collector/tree/master/extension/zpagesextension) of zPages.

zPages are uniquely useful in a couple of different ways. One is that they're more lightweight and quicker compared to installing external tracing systems like Jaeger and Zipkin, yet they still share useful ways to debug and gain insight into instrumented applications; these uses depend on the type of zPage, which is detailed below. For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send a subset of telemetry for reach analysis to save costs.

## Types of zPages

### Tracez

Tracez shows information on tracing, including aggregation counts for latency, running, and errors for spans grouped by name. In addition to these counts, Tracez also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out.

This zPage is also useful for debugging latency issues (slow parts of applications), deadlocks and instrumentation problems (running spans that don't end), and errors (where errors happen and what types). They're also good for spotting patterns by showing which latency speeds are typical for operations with a given span name.

### TraceConfigz

TraceConfigz allows the user to control how spans are sampled for both zPages and external backends.

For example, the sampling probability can be increased, decreased, or customized in other ways (i.e. depending on span parentage). Number of kept attributes, annotations, events, and links can also be adjusted. This would be useful for users that want to more accurately capture span insights or allow scaling better for exceptionally large and complex applications.

### RPCz

RPCz provides details on sent and received RPC messages, which is categorized by RPC methods. This includes overall and error counts, average latency per call, RPCs sent per second, and input/output size per second.

### Statsz

Statsz is focused on metrics, as it displays metrics and measures for exported views. These views are grouped into directories using their namespaces

## Design and Implementation

### Tracez Details

To implement Tracez, spans need to be collected, aggregated, and rendered on a webpage.

For OpenTelemetry, a custom `span processor` can be made to interface with the `Tracer` API to collect spans. This span processor collects references to running spans and exports completed spans to its own memory or directly to an aggregator. An alternative to a span processor is using some sort of profiler.

A `data aggregator` keeps track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55).

When the user visits the Tracez endpoint, likely something similar to `host:port/tracez`, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation.

For all of these, the thread safety of all of these components needs to be taken into account. With a span processor, data aggregator, and HTTP server configuration, there needs to be tests that ensure correct, deterministic, and safe behavior when the different components try to access the same data structures concurrently. Additionally, the span data itself needs to be thread-safe since those fields will be accessed or copied in the aggregator and server level.

### TraceConfigz Details

> TODO
### RPCz Details

> TODO
### Statsz Details

> TODO
## Shared zPages Components

### Wrapper

A zPages wrapper class acts as an API or injection point for zPages, instantiating and running all of the different zPages in the background without users needing knowledge of how they work under the hood.

An example of what happens when a user includes a wrapper: if OTel Python has Tracez and RPCz implemented and added to that wrapper, that wrapper will create instances of all the needed components (processors, aggregators, etc) for both zPages when zPages is initialized. If or when other zPages are added to OTel Python, developers adding them would only need to add the corresponding initialization code for those components in the wrapper.

Each zPages implementation ideally creates a wrapper class for zPages, since they would allow users to add all zPages with minimal effort. These wrappers should be as simple as adding 2 lines of code to include zPages (zPages import + initialization line).

### HTTP Server

All zPages have some sort of HTTP Server component to render their information on a webpage when a host:port and endpoint is accessed.

Traditionally, zPages have approached this by rendering web pages purely on the server-side. This means the server would only serve static resources (HTML, CSS and possibly Javascript) when the user accesses a given endpoint. Based on the type of zPage and the server language used, a pure server-side approach would generate HTML pages using hardcoded strings from scratch or using a template; this would tightly couple the data and UI layer.

All zPages need some server-side rendering, but the data and UI layer could optionally be separated by adding client-side functionality. This separation has benefits including 1.) allowing users to access isolated zPages data, such as when using wget on the endpoints serving JSON data, without HTML/CSS/Javascript and 2.) adding extensibility to zPages (e.g. the frontend can be centralized and used in multiple OTel language repositories). This approach is detailed below.

Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web REST API by translating stored data to JSON. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions. If the client requests the data via a request parameter or "Accept" HTTP header, that data should be available as a JSON-encoded response.

> TODO: add standardized URL endpoints for serving zPage data, along with expected JSON formatting and required/optional parameters
The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer.

In either case, a benefit of reasoning about the zPages HTTP server as a separate component means that zPages can be mounted in an existing server. For example, this can be done in Java by calling zPages logic from a servlet. It's also worth noting that having zPages in an embedded HTTP server increases the vulnerability of application and security risks by increasing its attack surface area. A malicious actor could potentially read sensitive data in telemetry, perform DOS attacks on the HTTP server, or initiate a telemetry storm by reconfiguring how telemetry is collected (i.e. through TraceConfigz); zPages should be reserved for protected dev environments for most cases because of this.

-------------

## Future Possibilities

### Out-process

Out-process zPages are ones that are compatible across different languages, which executes the processing of tracing and metrics information outside of applications (as opposed to in-process zPages).

- Pros
- zPages can be added to any OpenTelemetry repository, and future development can be completely focused there
- Cons
- More complicated than using local methods since it require setup (i.e. RPC or exporters) to allow zPages and applications to communicate. This would make it similar to other tracing systems.

### Shared Static Files

All HTML, CSS, and Javascript files would be used across different OTel language repositories for their in-process zPages

- Pros
- When client-side features are rolled out (including filtering/sorting data, interval refreshing, unit toggles), changes are all centralized
- Rendering logic and responsibility is focused and can be more effective, zPages developers can focus on other priorities
- Less difficult to share frontend information post-setup, follows OpenTelemetry's philosophy of being standardized
- Cons
- Adds computation of converting native data structures into JSON strings and serving these static files. May need extra libraries
- Some process has to be created to update the static files in a repository and serving them at the correct endpoints
- Initial setup may be difficult (one way this can be achieved is with Github modules)

### Proxy/Shim layer

> TODO
> GENERAL TODO: Link spec where possible, add pictures/figures and design docs links
Loading

0 comments on commit 170f380

Please sign in to comment.