Productionize Streaming Jobs for Service Dependencies #4590

yurishkuro · 2023-07-18T19:47:42Z

Currently we have two analytics solutions for generating service maps:

Jaeger Analytics Flink
- Real time streaming, requires Kafka.
- More feature rich, includes code for both 1-hop and transitive dependency graphs -- https://www.jaegertracing.io/docs/1.47/features/#topology-graphs
- Aggregates data for a given time window (originally at Uber - 15min) and writes a summary snapshot to storage
- Not easy deployment solution is provided in the repository.
Spark Dependencies
- Batch job that reads all data for a period of time, aggregates, and writes a summary snapshot to storage.
- Does not require Kafka.
- Theoretically can be run as frequently as 15min to produce similar results as Flink jobs above, but the implementation for Cassandra may need to be tweaked for that.
- Does not support transitive dependency graphs.

Objectives:

Ideally we want a single code base that supports both types of service dependencies
The solution needs to be documented, packaged (e.g. published containers) and easy to deploy (e.g. with docker compose or k8s operator)
Supporting both batch (goes directly against span storage) and streaming (reads from Kafka) is nice to have

tronda · 2023-07-19T20:23:40Z

Would one possible implementation to be to use ServiceGraphConnector to create service dependencies or is it not suitable for Jaeger's dependencies diagrams?

mohamedawnallah · 2023-07-22T18:51:44Z

Expression of Interest in this Mentorship Project - Productionize Streaming Jobs for Service Dependencies

Hello everyone,

I am genuinely interested in participating in this project as part of the LFX mentorship program in Q3. I have a strong understanding of the Distributed Tracing domain, having read the entire book Mastering Distributed Tracing.

Additionally, I have relevant experience that I believe aligns well with the objectives of the project.

Specifically, I have installed Jaeger in a production environment using the Kubernetes Operator. Additionally, I have configured Spark jobs to detect one-hop service dependencies in a simple instrumented application, comprising two services. One of these services fetches the IP address from a remote API endpoint, while the other formats the data.

My Instrumented Application Tracing Architecture

My Instrumented Application Architecture

My Instrumented Application in Production Cluster

Questions

I'm interested in the possibility of customizing this project to support batch and streaming processing for handling various service dependencies. However, I have a question: Will this new solution replace the existing Jaeger Analytics Flink and Spark Dependencies, or will it work alongside them?
During my exploration, I came across the jaeger-analytics-java repository. I'm curious to know how it fits into the overall project idea and if it brings additional value to the initiative.

Follow-up research resources

As I prepare to contribute, I would greatly appreciate it if you could recommend any additional resources or documentation to help me better understand this project and its specific requirements.

Looking forward to participating in this exciting endeavor!

yurishkuro · 2023-07-22T21:54:46Z

@mohamedawnallah I think it's worth looking into jaeger-analytics-java as well and deciding how it fits or overlaps with the rest. Ideally I would like to see a single repo / single trace analytics library that can be used with different streaming solutions.

mohamedawnallah · 2023-07-26T11:42:26Z

I wanted to share my progress so far on this issue. I have gained a clear understanding of the jaeger-analytics-java repository and its role in this project context. It serves as a Trace DSL (Domain Specific Language) metrics analytics and I've executed it on Jupyter Notebooks to comprehend its functionality, referring to the helpful Jaeger tracing article on Medium.

Additionally, I have introduced two new analytics metrics for the example hotrod application:

Calculated the Average Duration of Traces
Identified the Most Common Error Types

These metrics have been essential in understanding the Trace DSL API and the implementation of the Gremlin Query/Traversal Language from the Apache Tinkerpop Project.

Furthermore, I have observed that jaeger-analytics-java provides a metric for Service's direct downstream dependencies. To obtain this metric, we need to run the corresponding job that specifically supports 1-hop service dependencies.

mohamedawnallah · 2023-07-26T11:47:34Z

I'm currently considering an implementation approach for this project. One idea is to enhance the Jaeger Analytics Flink repository's deployability, making it easier to set up. By pursuing this approach, we can meet at least the bare minimum requirements for this project to support both service dependencies (1-hop and transitive dependencies) while ensuring a straightforward deployment process.

@yurishkuro, I'd love to hear your thoughts on this!

yurishkuro · 2023-07-27T02:26:45Z

What I am curious about is whether it's possible to consolidate streaming business logic into a library L so that the same library could be used with multiple streaming runtimes, e.g.

Flink source/runtime -> L -> Flink sink
Spark source/runtime  -> L -> Spark sink

Few years ago it wasn't possible because Spark and Flink used different APIs to describe the transformation flows. But since Java Streams were introduced, I was under impression that the UDFs could be expressed in Java Streams and work for both. This is just my assumption, would be good to confirm.

The reason why I think it's useful to have this reusability is because supporting Spark allows offline batch processing, which may be a useful feature for some, not to mention that some organizations are running only Spark and not Flink.

mohamedawnallah · 2023-07-27T18:55:41Z

@yurishkuro I have recently explored the idea of consolidating streaming business logic into a library to make it compatible with multiple streaming runtimes, such as Apache Flink and Apache Spark.

While the Java Stream API might initially seem like a good fit, I discovered that it is designed for processing data within a single JVM, making it suitable for in-memory processing on a single machine. The distributed version of Java Stream API is DStream. For more in-depth information, there is a paper on DStream from the University of York that delves into its specifics and also discusses the issues with Java Stream API: Dstream Paper.

The DStream (Discretized Stream) is employed as a low-level API design in Apache Spark for its streaming capabilities. However, Dstream is not supported in Apache Flink, and it works through microbatching, which doesn't qualify as a true streaming framework. Though Spark is now experimenting with Continuous Streaming Processing, it's still not mature as the streaming capabilities in Apache Flink.

In contrast, Apache Flink simplifies the process with a unified DataStream API, which can handle both batch and streaming processing modes without the need to rewrite code. This feature makes Flink a more flexible choice, especially for organizations that wanna use a single data processing platform with the same API for both batch and stream processing.

In conclusion, while Java Streams might be not suitable for the desired cross-runtime compatibility, Apache Flink's DataStream API offers a promising solution for building reusable streaming and batching business logic that can be deployed seamlessly.

@yurishkuro I also would love to hear your thoughts on this!

mohamedawnallah · 2023-08-21T07:41:39Z

Hey @yurishkuro I'd still like to work on this issue outside the official LFX mentorship. Any thoughts?

yurishkuro · 2023-08-21T15:07:05Z

@mohamedawnallah most of our code is already written for Flink, so it's fine to keep it and package for prod deployment.

mohamedawnallah · 2023-08-21T16:23:11Z

Great so by packaging Jaeger Analytics Flink for production means:

Package both the 1-hop service dependencies (Dependencies Job) and the transitive dependencies (Deep Dependencies Job) into Docker containers using a Docker compose file.
Inject Data Sink Dependencies i.e. Apache Cassandra configurations as command line arguments while running the docker-compose file.
Document how to publish Jaeger Analytics Flink in Production and their relevant containers.

@yurishkuro I'd love to hear your thoughts on this and if there is anything I'm missing

yurishkuro · 2023-08-21T17:39:34Z

Yes, plus (4) set up CI integration tests to validate that those packages are operational.

But on (1), the docker-compose is not the "production" packaging, usually it's just an example & integration test, while the actual packaging is just the runnable Docker images. Another option is to extend the K8S Operator to support deployment of these images too (when used with Kafka, of course).

yurishkuro · 2023-08-21T17:40:43Z

on (2), it would be good to support other backends too, not just Cassandra (at minimum ES / OS).

mohamedawnallah · 2023-08-25T02:13:39Z

Thanks @yurishkuro for your additions. I know "ES" stands for ElasticSearch but What does "OS" stand for in the storage?

yurishkuro · 2023-08-25T03:26:23Z

OpenSearch

mohamedawnallah · 2023-08-25T04:00:43Z

Okay, I'm gonna start working on the issue but I'd like to know if you've any suggestions about communication while working on this issue Also the repository dedicated to this issue is Jaeger Analytics Flink?

yurishkuro · 2023-08-25T04:38:54Z

suggestions about communication while working on this issue

I would recommend creating a proposal / plan of what you plan to do and how. This is not a 1-day project, so the plan should contain multiple milestones. We can copy them as a checklist into the ticket description and tick off as each milestone is reached. This would provide good visibility on the progress.

mohamedawnallah · 2023-08-25T05:14:11Z

Sounds great!! I'm gonna send a proposal soon of what I plan to do and how regards Productionize Streaming Jobs for Service Dependencies project. Would I send it on Slack private messaging?

yurishkuro added the enhancement label Jul 18, 2023

yurishkuro mentioned this issue Jul 18, 2023

Jaeger's LFX Mentorship Sep-Nov-2023 Projects #4457

Closed

yurishkuro added mentorship help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Jul 18, 2023

yurishkuro removed the mentorship label Aug 3, 2023

yurishkuro added the possible mentorship Potential mentorship project label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Productionize Streaming Jobs for Service Dependencies #4590

Productionize Streaming Jobs for Service Dependencies #4590

yurishkuro commented Jul 18, 2023 •

edited

Loading

tronda commented Jul 19, 2023 •

edited by yurishkuro

Loading

mohamedawnallah commented Jul 22, 2023 •

edited

Loading

yurishkuro commented Jul 22, 2023

mohamedawnallah commented Jul 26, 2023 •

edited

Loading

mohamedawnallah commented Jul 26, 2023 •

edited

Loading

yurishkuro commented Jul 27, 2023

mohamedawnallah commented Jul 27, 2023 •

edited

Loading

mohamedawnallah commented Aug 21, 2023

yurishkuro commented Aug 21, 2023

mohamedawnallah commented Aug 21, 2023 •

edited

Loading

yurishkuro commented Aug 21, 2023

yurishkuro commented Aug 21, 2023

mohamedawnallah commented Aug 25, 2023

yurishkuro commented Aug 25, 2023

mohamedawnallah commented Aug 25, 2023

yurishkuro commented Aug 25, 2023 •

edited

Loading

mohamedawnallah commented Aug 25, 2023 •

edited

Loading

Productionize Streaming Jobs for Service Dependencies #4590

Productionize Streaming Jobs for Service Dependencies #4590

Comments

yurishkuro commented Jul 18, 2023 • edited Loading

tronda commented Jul 19, 2023 • edited by yurishkuro Loading

mohamedawnallah commented Jul 22, 2023 • edited Loading

My Instrumented Application Tracing Architecture

My Instrumented Application in Production Cluster

Questions

Follow-up research resources

yurishkuro commented Jul 22, 2023

mohamedawnallah commented Jul 26, 2023 • edited Loading

mohamedawnallah commented Jul 26, 2023 • edited Loading

yurishkuro commented Jul 27, 2023

mohamedawnallah commented Jul 27, 2023 • edited Loading

mohamedawnallah commented Aug 21, 2023

yurishkuro commented Aug 21, 2023

mohamedawnallah commented Aug 21, 2023 • edited Loading

yurishkuro commented Aug 21, 2023

yurishkuro commented Aug 21, 2023

mohamedawnallah commented Aug 25, 2023

yurishkuro commented Aug 25, 2023

mohamedawnallah commented Aug 25, 2023

yurishkuro commented Aug 25, 2023 • edited Loading

mohamedawnallah commented Aug 25, 2023 • edited Loading

yurishkuro commented Jul 18, 2023 •

edited

Loading

tronda commented Jul 19, 2023 •

edited by yurishkuro

Loading

mohamedawnallah commented Jul 22, 2023 •

edited

Loading

mohamedawnallah commented Jul 26, 2023 •

edited

Loading

mohamedawnallah commented Jul 26, 2023 •

edited

Loading

mohamedawnallah commented Jul 27, 2023 •

edited

Loading

mohamedawnallah commented Aug 21, 2023 •

edited

Loading

yurishkuro commented Aug 25, 2023 •

edited

Loading

mohamedawnallah commented Aug 25, 2023 •

edited

Loading