Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Tracing (derived from OpenTelemetry) #15

Open
CMCDragonkai opened this issue Jul 19, 2022 · 32 comments
Open

Integrate Tracing (derived from OpenTelemetry) #15

CMCDragonkai opened this issue Jul 19, 2022 · 32 comments
Assignees
Labels
development Standard development r&d:polykey:supporting activity Supporting core activity

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 19, 2022

Specification

OpenTelemetry is an overly complicated beast. It's far too complex to adopt into a logging system. However the basic principles of tracing makes sense. Here I'm showing how you can set one up for comparison testing, for us to derive a tracing schema and later visualise it ourselves or by passing it into an OTLP compatible visualiser.

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.36

The above command runs jaeger. Take note of 4318 port which is the OTLP protocol over HTTP.

Visit localhost:16686 to be able to view the jaeger system.

Then any example code, like for example https://github.com/open-telemetry/opentelemetry-js/blob/main/examples/basic-tracer-node/index.js can run and push traces directly to the docker container.

What is frustrating is:

  1. OpenTelemetry code only exports to stderr as an afterthought, it's not considered first class usage
  2. The stderr exporters output via console.log and produce pretty printed results that are not actual JSON. Thus you cannot just pipe it to a relevant location.
  3. The schema of the span data isn't clear, it seems different parts of the documentation still have old data, or maybe the JS implementation itself is hasn't been updated to the new schema.

The plan:

  1. Create your own "span" derived from opentelemetry and output as just regular structured JSON
  2. Massage it to be compatible to open telemetry viewers like jaeger
  3. Use jaeger's 4318 to stream the JSON and view data in the interim
  4. Find an easier way to visualise traces, maybe something that can be used CLI or in the GUI
  5. For production usage, feed to any structured log capturer, and then feed into a viewer that understands trace information

Additional context

Tasks

  1. ...
  2. ...
  3. ...
@CMCDragonkai CMCDragonkai added the development Standard development label Jul 19, 2022
@CMCDragonkai
Copy link
Member Author

It seems alot of the complexity is due to the vendors fragmentation and they are trying to make everything compatible.

@CMCDragonkai CMCDragonkai mentioned this issue Jul 19, 2022
15 tasks
@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jul 19, 2022

Most tracing tools like https://nodejs.org/api/tracing.html and chrome:://tracing expect a finite dataset, that is expected that a trace has a beginning and end. That's why it's always been "request" driven. Open telemetry is just deriving stuff that came before like in https://github.com/gaogaotiantian/viztracer https://github.com/janestreet/magic-trace https://github.com/kunalb/panopticon and more.

I'm interested in more than just request-driven tracing but live infinite traces (call it continuous tracing that shows finished and live spans at the same time), and correlates them too. I'm guessing we need zoomable levels of detail the ability to filter out irrelevant information dynamically.

Open telemetry in particular does not appear to emit a span until it is done. I'd imagine knowing when a span started even if it did not end yet would be useful for live continuous tracing.

@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 2 Cross Platform Cryptography for JavaScript Platforms r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices discussion Requires discussion bug Something isn't working design Requires design documentation Improvements or additions to documentation duplicate This issue or pull request already exists enhancement New feature or request invalid This doesn't seem right question Further information is requested research Requires research wontfix This will not be worked on help wanted procedure ops and removed r&d:polykey:core activity 2 Cross Platform Cryptography for JavaScript Platforms r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 23, 2022
@CMCDragonkai CMCDragonkai removed their assignment Sep 1, 2024
@Abby010 Abby010 self-assigned this Feb 16, 2025
@Abby010
Copy link

Abby010 commented Feb 18, 2025

Why do parents span finish before child span ? Will this always be the case ?

@Abby010
Copy link

Abby010 commented Feb 18, 2025

In case of our solution how should spans be structured in JSON ?

@Abby010
Copy link

Abby010 commented Feb 18, 2025

The tracing goes from top to bottom, and represents an 'infinite' live visualisation of what the current state of the system is. Why do we need an 'infinite' live visualisation ?

@Abby010
Copy link

Abby010 commented Feb 18, 2025

Are we visualising both completed and in-progress spans at the same time? If yes, how ?

@Abby010
Copy link

Abby010 commented Feb 18, 2025

How does forking impact trace performance ? Will it slow down the system ?

@Abby010
Copy link

Abby010 commented Feb 18, 2025

If we need zoomable levels of details to filter out information dynamically, how will be support this functionality ?

Copy link
Member Author

Because that's how we can debug how object contexts exist in real time.

Copy link
Member Author

Tracing isn't going to be super fast. It's fine, we do this for debug reasons - we can optimise this later.

Copy link
Member Author

Zoom able is a UI concern. Tracing data is logged entirely.

@CMCDragonkai
Copy link
Member Author

Are we visualising both completed and in-progress spans at the same time? If yes, how ?

Let's separate issues for collecting data vs visualising data.

@Abby010
Copy link

Abby010 commented Feb 18, 2025

Functional Requirements

  1. Generate structured JSON spans - Instead of OpenTelemetry’s pretty-printed stderr output, generate structured JSON logs for better processing.

  2. Ensure compatibility with OpenTelemetry viewers (Jaeger, Zipkin, etc.) - The generated spans should be formatted correctly so they can be visualized in Jaeger or other OTLP-compatible tools.

  3. Stream JSON spans via Jaeger’s 4318 OTLP HTTP port - Send the structured JSON directly to Jaeger to provide immediate visualization of spans.

  4. Allow easy visualisation of traces (CLI or GUI tool) - Provide a real-time visualisation option that allows developers to inspect spans dynamically.

  5. Enable live & historical trace views - Ensure that completed spans are stored and retrievable, so historical traces can be analysed alongside real-time data.

  6. Integrate with log collectors (Fluentd, Loki, Elasticsearch, etc.) Traces should be stored in a structured log system for long-term observability and debugging.

Non Functional Requirements

  1. Low performance overhead - The tracing system should not introduce significant latency to applications.

  2. Scalability - The system should handle high throughput tracing data without bottlenecks.

  3. Real-time processing - The tracing solution should emit spans immediately when they start, rather than waiting for them to finish.

  4. Reliability - Traces must not be lost, even in case of network failures or system crashes.

@CMCDragonkai
Copy link
Member Author

I want to avoid any IO in or out tracing. Our library should be pure data structures first and then allow generic construction of a span. I don't like open tracing spans but we can be backwards compatible with it.

@CMCDragonkai
Copy link
Member Author

See the 3 layer cake concept.

@CMCDragonkai
Copy link
Member Author

Avoid Jaeger or any of the OT ecosystem. I don't like them they suck.

@CMCDragonkai
Copy link
Member Author

Btw in-memory format should just be a POJO that can be converted to json.

@CMCDragonkai
Copy link
Member Author

Create a span structure for beginning and end. Use react-ink to setup a CLI that visualised top to bottom. Get a preview of this using asciinema. And post the video here.

Copy link
Member Author

I want you to try and write a simple library right here with a new PR:

  1. Creation of a span and ending of a span.
  2. Forking spans.
  3. Then in a separate src/bin directory, create a CLI script using TS that can use react-ink to visualise the spans as vertical lines starting from the top to the bottom, it should auto-scroll downwards every second. We can iterate this.
  4. Note that react ink takes over the full screen, thus it's a TUI. A CLI app would actually just print one line at a time. We should be able to do this as well, similar to a follow function of tail. Try it.
  5. I would like to see this prototype end of the week, so you can demonstrate beginning of the next cycle.

Copy link
Member Author

@abhishek.mehta you need to link up your github account, at the moment assignments aren't aligned between github and linear.

@Abby010
Copy link

Abby010 commented Feb 19, 2025

I have already started working on this; however, as I have reached my 24 hour limit for this week, I will be able to continue next week. Thanks

@valyala
Copy link

valyala commented Feb 20, 2025

Integrate with log collectors (Fluentd, Loki, Elasticsearch, etc.) Traces should be stored in a structured log system for long-term observability and debugging.

Consider also using VictoriaLogs. Is is easier to setup and operate than Loki and Elasticsearch, and it usually uses less RAM, CPU and disk space comparing to Loki and Elasticsearch.

@CMCDragonkai
Copy link
Member Author

Cool @valyala but this issue is more about specific in-app tracing that shouldn't be tied to any cloud service. We want to separate collection from visualisation from storage.

@CMCDragonkai
Copy link
Member Author

Most tracing tools like https://nodejs.org/api/tracing.html and chrome:://tracing expect a finite dataset, that is expected that a trace has a beginning and end. That's why it's always been "request" driven. Open telemetry is just deriving stuff that came before like in https://github.com/gaogaotiantian/viztracer https://github.com/janestreet/magic-trace https://github.com/kunalb/panopticon and more.

I'm interested in more than just request-driven tracing but live infinite traces (call it continuous tracing that shows finished and live spans at the same time), and correlates them too. I'm guessing we need zoomable levels of detail the ability to filter out irrelevant information dynamically.

Open telemetry in particular does not appear to emit a span until it is done. I'd imagine knowing when a span started even if it did not end yet would be useful for live continuous tracing.

@Abby010 the last paragraph is key.

@Abby010
Copy link

Abby010 commented Feb 23, 2025

Preview of the CLI using react-ink
https://asciinema.org/a/cIfBMyC4ENq1UoQ6z2kFRZqwF

Copy link
Member Author

What's the status?

@Abby010
Copy link

Abby010 commented Mar 3, 2025

Status Update

Current Progress:

  • We now have a working visualization (Attaching terminal recordings for reference at the end).
  • The core library has been implemented, and we are now focusing on improving visualization.

Research Done:

✔ Analyzed git log --graph & tree command for box-drawing character alignment.
✔ Explored UTF-8 box characters (│ ├ └ ─) for structured branching.
✔ Reviewed time-based vs. logical event-based sampling logic for implementing switching.
✔ Studied grid-based painting algorithms to properly align and render spans in a structured format.

Next Steps:

  • Implement box-drawing characters for structured TUI visualization (inspired by git log --graph).
  • Add --sample logical vs. --sample 1s switching for time-based vs. logical event ordering.
  • Test and refine span rendering for readability, hierarchy correctness, and terminal compatibility.

Timeline:

🎯 Targeting completion by the next sprint meeting - 10th March, 2025

Terminal 1: React-Ink Visualization

  • Preview (https://asciinema.org/a/W1yuT5ZngCE8AFkG1VG9yXfVX) - start from the 10th Second
  • Runs cli.tsx, which uses React-Ink to display spans in a tree-like, real-time interface.
  • Shows each span (e.g., User Request, Order Processing, etc.) as a vertical hierarchy, updating every second with any new or completed spans.

Terminal 2: Simple Tail-Style Output

  • Preview (https://asciinema.org/a/dqgzPEHVjRvERp44RnA3F00kp) - start from the 20th second
  • Runs simple-cli.tsx, which prints raw JSON logs of the spans, similar to tail -f on a log file.
  • Displays span data in an array (one array entry per span) and updates every second to reflect newly created or completed spans.

Terminal 3: Test Script (asciinemaTest.ts)

  • Preview (https://asciinema.org/a/g6um0fQMdHV3co6ThtviRh6oN)
  • Generates spans by calling logger.info(), which under the hood calls openSpan and closeSpan.
  • Simulates real operations like "User Request", "Order Processing", and "Payment Processing", each with delayed completions.
  • Provides the tracing data that Terminals 1 and 2 observe in real-time.

Copy link
Member Author

@abhishek.mehta you should start to write out the tasks in this issue. Plus your progress should be in the associated feature-branch PR.

Copy link
Member Author

BTW your viz shows using ts-node, we already moved away from that, we use tsx. Actually have a look at ESM migrated repos and start writing your scripts following how we write ESM like code and scripts/executables. See benches as an example.

Copy link
Member Author

image.png

Any forking should have a \ to fork out if you're using pure ASCII.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:supporting activity Supporting core activity
Development

No branches or pull requests

3 participants