Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start zPages Experimental Spec #767

Merged
merged 20 commits into from
Aug 20, 2020

Conversation

jajanet
Copy link
Contributor

@jajanet jajanet commented Aug 7, 2020

This PR begins providing more insight into what zPages are, their background, different types utilized in the past (including uses and other details), and other avenues to explore for zPages.

Implementation insights are also provided for TraceZ based off the findings from the OpenTelemetry C++ TraceZ project. Some links are provided, and more will be added in the future along with recommendations on interfacing more specifically using the OpenTelemetry API/SDK.

Related OTEP

@SergeyKanzhelev

@jajanet jajanet requested a review from a team August 7, 2020 17:41
Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

experimental/zpages.md Outdated Show resolved Hide resolved
-------------
## Future Possibilities
### Out-process
zPages that compatible across different languages, with processing of information happening outside of applications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the zpage extension on the collector? I guess the otlp receiver should already have zpages similar to what the SDK provides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize the collector had zPages, so I'm not sure how/if the collector zPages differs from the individual OTel language implementations of zPages. I will update you and change things once I look into it more, thanks for the information!

experimental/zpages.md Outdated Show resolved Hide resolved
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-posting my previous comments from the email thread, I don't see them particularly well reflected here:

Many frameworks already provide similar functionality, for example Uber has an internal plugin that renders pages like below (only showing the sidebar menu). I think the following are important for debug pages design:

  • pages framework should not be coupled with OTel SDK, it can be a standalone library, so that SDK can supply providers into it, but other tools can also supply providers
  • OTel providers should be implemented as separate components for data vs. rendering, so that, for example, data providers could be also integrated into other rendering frameworks (like Uber's debug pages below).
  • open question: should rendering be done in-app or should the app only provide data and rendering done via a sidecar? Uber's framework, as well as many OSS projects (e.g. Flink Job Master, Spark jobs, etc), opt for including rendering in-app, simplifying the use of debug pages.

image

@@ -0,0 +1,112 @@
# zPages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest dropping the "z" theme, which afaik is an internal Google speak. We can call them "debug pages", similar to /debug/vars used in Go, which is way more descriptive than "z-pages".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like z, because it doesn't have any connotation like debug. Debug is sometimes associated with something slow and not for production.

Spring calls it "Production Ready Endpoints": https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-endpoints Perhaps we can call it the same.

As for endpoint names, dropping z is OK with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jajanet can you call the page "Production Ready Endpoints"? @yurishkuro @anuraaga @arminru ware you advocating for simply dropping z from the names of endpoints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't consider "debug" to mean "slow". Debug means investigating performance of a process by looking deeper into the signals that are not available through normal monitoring. Changing logging level is also an example of debugging.

"Production ready" sounds much more narrow, like a healthcheck endpoint.

Another possible name is "introspection endpoints", however that is also more narrow than what the framework should provide.

Copy link
Contributor

@anuraaga anuraaga Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev I would drop the z from both the endpoints and names, I agree with @yurishkuro we can have a semantic name that is self-explanatory, and URLs with z's are not as clean as with normal English IMO (z suffixes are from an era when these pages were exposed to the public internet).

debug seems fine to me, I might name a similar concept just "internal pages", or the other name suggestions seem fine to me since they're still self explanatory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would everybody be OK with "Introspection endpoints"? We might need more opinions, but I feel very strongly that "debug" will be considered as something not acceptable for production by customers.

Copy link
Contributor Author

@jajanet jajanet Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the conversation on naming conventions, my understanding is:

  • rename all references of zPages to Introspection Endpoints
  • drop the z from TraceZ, RPCz, etc

Is this correct? Once this is clarified and decided, I'll update the file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jajanet Sounds good to me!
We could also add an Internal or OpenTelemetry SDK there to further disambiguate it from telemetry about user/app code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am late to the party, but perhaps "Diagnostic Pages" or "Diagnostic Endpoints"? I feel it conveys both the debugging and introspection meanings.

OpenCensus also uses the term for describing what it really is:

They are useful to for in-process diagnostics

While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below.

## Types of zPages
### TraceZ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for Z suffixes

experimental/zpages.md Outdated Show resolved Hide resolved

You can read about TraceConfigZ more [here](https://opencensus.io/zpages/java/#traceconfigz).

### RPCz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a projection of the overall metrics view, does it need to be pulled out? If it is pulled out to the top level, what conventions is it going to use for detecting relevant metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, metrics and collecting RPC call information is separate? I'm not exactly sure, I think @SergeyKanzhelev could provide better insight


You can read about RPCz more [here](https://docs.google.com/document/d/1RWNyUIaKTYK12tck_rQjki4jTyHFfkD8sk54mjwRwso/edit#) and [here](https://opencensus.io/zpages/java/#rpcz).

### StatsZ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using "stats" in OTel, we use "metrics".

Copy link
Contributor Author

@jajanet jajanet Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! So everywhere with stats, I should replace with metrics? Also thanks for the detailed information and explanations!

@SergeyKanzhelev SergeyKanzhelev self-assigned this Aug 10, 2020

The idea of "zPages" originates from one of OpenTelemetry's predecessors, [OpenCensus](https://opencensus.io/). You can read more about zPages from the OpenCensus docs [here](https://opencensus.io/zpages) or the OTEP [here](https://github.com/open-telemetry/oteps/blob/master/text/0110-z-pages.md). OpenCensus has different zPage implementations in [Java](https://opencensus.io/zpages/java/), [Go](https://opencensus.io/zpages/go/), and [Node](https://opencensus.io/zpages/node/) and there has been similar internal solutions developed at companies like Uber. Within OpenTelemetry, zPages are also either developed or being developed in [C#](https://github.com/open-telemetry/opentelemetry-dotnet/tree/master/src/OpenTelemetry.Exporter.ZPages), Java, and C++.

While zPages are uniquely useful in being more lightweight and quicker compared to installing external exporters like Jaeger and Zipkin, they still offer many useful ways to debug and gain insight into applications. The uses depend on the type of zPage, which is detailed below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that "For high throughput applications, external exporters are typically configured to send subset of telemetry for reach analysis to save costs, while zPages, being in-memory, can analyze more telemetry with the limited set of supported scenarios."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to this line (26) as:

For high throughput applications, zPages can also analyze more telemetry with the limited set of supported scenarios than external exporters; this is because zPages are in-memory while external exporters are typically configured to send subset of telemetry for reach analysis to save costs.

Let me know if I should change it!


### TraceConfigZ

TraceConfigZ is closely related to TraceZ, allowing the user to modify how spans are sampled or how much data to keep in TraceZ by updating the TraceZ components accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are TraceConfig pages control sampling for spans that would be exported by regular exporter? I don't think TraceConfigZ is related to TraceZ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure; my understanding was that TraceConfigz is for TraceZ for OpenCensus, but I will look into it more and edit things accordingly. If it makes more sense, the decision here also doesn't need involve OpenCensus' zPages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with someone who implemented TraceConfigz and they said TraceConfigz would effect all sampling for zPages and external backends. I updated the spec accordingly, and I'm wondering if that's that what we want in the spec?


For OpenTelemetry, a custom `span processor` can be made to interface with the `Tracer` API to collect spans. This span processor collects references to running spans and exports completed spans to its own memory or directly to an aggregator. An alternative to a span processor is using some sort of profiler.

A `data aggregator` keeps a track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracez is also designed to keep samples for every bucket of latency. This is a big benefit comparing to regular exporter. zPages will store unique spans that falls into specific latency bucket or having an error - those, that would likely be sampled out for the regular exporter

Copy link
Contributor Author

@jajanet jajanet Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reclarified under line 30 (explanations/overviews on the types of zPages) as:

In addition to these counts, TraceZ also keeps a set number of samples for error, running, and latency (including within each duration bucket) spans for each span name to allow users to look closer at span fields. This is particularly useful compared to external exporters that would otherwise likely sample them out.

Does that sound okay? I meant for this later section for outline more architecture details, but could restructure things as needed

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is one change that we need to close on. If no comments from Yuri and others, we can keep the current naming.

I think it's critical to explain that tracez works BEFORE sampling and will catch unique samples of Spans that otherwise wouldn't be collected with the regular exporter. See my comments

Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>
@jajanet jajanet requested review from a team and carlosalberto August 12, 2020 00:37
@carlosalberto carlosalberto added area:sdk Related to the SDK spec:trace Related to the specification/trace directory release:after-ga Not required before GA release, and not going to work on before GA labels Aug 12, 2020
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see any mention of security. zPages available via an embedded HTTP server increase the attack surface of the application. It can potentially allow a malicious actor to:

  • Read sensitive data contained in telemetry.
  • Perform a DOS attack on the HTTP server.
  • Initiate a telemetry storm by reconfiguring the telemetry collection (e.g. by changing the sampling factor cause significant increase of exporter telemetry).

I would like to understand what is our stance on this. Should zPages be enabled by default or we disable them by default and they need to be explicitly enabled by the end user

@@ -0,0 +1,112 @@
# zPages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am late to the party, but perhaps "Diagnostic Pages" or "Diagnostic Endpoints"? I feel it conveys both the debugging and introspection meanings.

OpenCensus also uses the term for describing what it really is:

They are useful to for in-process diagnostics


A `data aggregator` keeps track of counts for running, error, and latency buckets for spans grouped by their name. It also samples some spans to provide users with more information. To prevent memory overload, only some spans are sampled for each bucket for each span name; for example, if that sampled span max number is set to 5, then only up to 55 pieces of span data can be kept for each span name in the aggregator (sampled_max * number of buckets = 5 * [running + error + 9 latency buckets] = 5 * 11 = 55).

When the user visits the TraceZ endpoint, likely something similar to `host:port/tracez`, then the distribution of latencies for span names will be rendered. When clicking on buckets counts for a span name, additional details on individual sampled spans for that bucket would be shown. These details would include trace ID, parent ID, span ID, start time, attributes, and more depending on the type of bucket (running, error, or latency) and what's implemented/recorded in the other components. See [HTTP Server](#http-server) for more information on implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we standardize the port number? It will make the discovery easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not at the moment. I think it's not critical for the experimental stage unless we will get some usability results. Perhaps a separate research will be needed what port other things like springboot endpoints standardize on...


Instead of directly translating native data structures to HTML strings based on the stored information, the data layer would do 2 things depending on the webpage endpoint accessed: 1. Serve the static HTML, JS, and CSS files, which are consistent, not server generated, and not data dependent and 2. Act like a web/HTTP API by translating stored data to JSON strings. Whether the data layer does one or the other depends on which URL endpoint is accessed; the former is intended for the initial zPages load, and latter for user interactions.

The UI/frontend/rendering layer is the HTML, CSS, and Javascript itself, in contrast to the logic to serve those files. This frontend uses the data layer's API on the client-side within the browser with Javascript by accessing certain endpoints depending on the user's actions. The data returned interacts with the Javascript, which determines and executes the logic necessary to render updates to the HTML DOM. Modifying the HTML DOM means there are no unnecessary requesting and re-rendering static files, and only parts of the webpage are changed. This makes subsequent data queries quicker and requires no knowledge of client-side rendering for the zPages developer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that rendered zPages will be ONLY available if the client is JS-enabled? In troubleshooting scenarios I may have no access to a full-blown browser and just fire a wget against the zPage.

I agree that the separation to the data collection and rendering components is useful. I think it is also very useful to expose the zPage data purely as data, e.g. in JSON format, available via a REST API. However, I think there is also a value in simple human-readable HTML/CSS output that is rendered on server side and is available without requiring JS-enabled web clients.

I think a good approach would be to render on the server-side but for each endpoint also support JSON-encoded responses if the client requests it (via a request parameter or via "Accept" HTTP header).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how OTEP is formulated and experimental code is implemented. Perhaps those endpoints are important to document. @jajanet can you please add a TODO in the document saying that those data endpoints needs to be documented as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point and I'll do that! I'll also add points on security, as I didn't think about that before

@jajanet
Copy link
Contributor Author

jajanet commented Aug 18, 2020

Hey everybody! Based on the discussion, let's have a poll on the new zPages name to help finalize this PR. Probably gonna have this up for at least a couple of days, and/or when there's a clear winner. In any case, the name discussion can be an open one and may be postponed to a later PR depending on the results. Vote using emojis!

Prefix:
Laugh - Introspection
Hooray - Debugging
Heart - Diagnostic

Suffix:
Rocket - Endpoints
Eyes - Pages

@SergeyKanzhelev @anuraaga @yurishkuro @tigrannajaryan @arminru 

@jajanet jajanet requested a review from a team August 18, 2020 21:01
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good starting point.

@SergeyKanzhelev
Copy link
Member

this is experimental feature with the lower bar for approvals. Merging for quicker iterations...

@SergeyKanzhelev SergeyKanzhelev merged commit c6280ed into open-telemetry:master Aug 20, 2020
@SergeyKanzhelev
Copy link
Member

Naming thing should be addressed separately.

@bogdandrutu
Copy link
Member

@jajanet @SergeyKanzhelev This PR broke the build, can you fix lint errors:
https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-specification/2038/workflows/bae573d8-2a3d-447f-98d5-0e28032a5621/jobs/3774

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
* Fix links that weren't working

* Start zPages spec with basic details

* Formatting, todo details

* Fix link

* Fix design link

* Fix some typos, add otep link

* OT->OTel, other detail cleanup

* Define zpages deadlock more accuractely

Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>

* Add suggestions and clarifications

* Fix typos and grammar errors

* Update tracez sentence flow

* Fix grammar issues, mention OTel collector zpages

* Add security concerns, data url endpoints TODO

* Fix typo, uppercasing

* Reclarify TraceConfigz

* Nest zPages.md in trace folder, clarify ui/data separation

* More cleanup and typo fixes

Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants