Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metrics generator for num_calls #286

Merged
merged 17 commits into from
Nov 29, 2021
Merged

Conversation

kotharironak
Copy link
Contributor

@kotharironak kotharironak commented Nov 12, 2021

Description

As part of the building metrics pipeline, and as part of this ticket (hypertrace/hypertrace#294), this PR,

  • add metrics generator component for call count
  • it extracts the metrics from ingested traces and reads data from raw-service-view-events
  • added configs for emit and aggregation windowing

Testing

  • added basic topology test

Checklist:

  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules

@kotharironak
Copy link
Contributor Author

Fails because of new avro files due to compatibility checks, will remove avro chack.

@kotharironak
Copy link
Contributor Author

@JBAhire Can you help in disabling avro compatibility check for this project - see the diff. I guess we will have to merge a separate PR for pull_request_target first to disable that, right?

@kotharironak kotharironak marked this pull request as ready for review November 18, 2021 10:12
@kotharironak kotharironak requested a review from a team November 18, 2021 10:12
@JBAhire
Copy link
Member

JBAhire commented Nov 18, 2021

@JBAhire Can you help in disabling avro compatibility check for this project - see the diff. I guess we will have to merge a separate PR for pull_request_target first to disable that, right?

Yup. We will need a separate PR for that one.

@aaron-steinfeld
Copy link
Contributor

@JBAhire Can you help in disabling avro compatibility check for this project - see the diff. I guess we will have to merge a separate PR for pull_request_target first to disable that, right?

Yup. We will need a separate PR for that one.

I'd recommend pulling the compatibility checks into a separate job altogether. Require it in the same way, but that way we can override it if needed without code changes - basically what we do with proto.

Copy link
Contributor

@aaron-steinfeld aaron-steinfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really have the context, but a couple high level comments

@@ -0,0 +1,7 @@
@namespace("org.hypertrace.metrics.generator.api")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we still adding avro messages? I thought we had fully switched to proto for new things, even across kafka

Spacing also looks off.

Copy link
Contributor Author

@kotharironak kotharironak Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we still adding Avro messages?

Are you referring to all the new pipeline - alerting and metric or something else?

I will check with @laxman-traceable If I can directly go with java based serialization with StateStore and get away with avro messages here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding - could be wrong - was that we initially used avro over kafka because that's what it had support for, and proto everywhere else. Newer versions have equal levels of proto support, so rather than using two different serialization formats, unifying on just proto for all new messages would be more desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @laxman-traceable. And as per him, avro goes better with kafka overall, and as you said the newer versions have support for schema registry for proto too. But, as you said, in this metrics pipeline, proto is the main transport layer, so he was also suggesting to go with Proto for consistency. Changing it to proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

protocol MetricIdentityProtocol {
record MetricIdentity {
long timestamp_millis = 0;
union {null, string} metric_key = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much context, but what is a metric key? where's the tenant id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenant_id is part of metric attribute -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric_key is UUID from metric_name and set of metric's attribute name:value pair. This is for reducing serde cost of updating the state store on every input message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still trying to understand - would this mean there are no actual attributes available for querying, and that everything needs to be queried by ID only? I couldn't find metrics that all have attribute A for example. How would things like filtering work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such global queries filtering is possible in PromQL language.
{} refers to filtering expression - https://prometheus.io/docs/prometheus/latest/querying/basics/#instant-vector-selectors

As an example, below query returns all metrics having attribute (or label)
pod with value raw-spans-grouper : {pod="raw-spans-grouper}

However, the filtering expression with a specific metric_name will be able to filter within a group of those series - metric_name{<lable filtering>}.
So, if we are looking for num_calls (is a metric) for a given service (is an attribute), we can express as sum(num_calls{service="1234"})

I had listed out a few mappings here - hypertrace/hypertrace#295 (comment)

Can you give one more example that you are trying to refer to?

@namespace("org.hypertrace.metrics.generator.api")
protocol MetricProtocol {
record Metric {
union {null, string} name = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different than the key? If we're emitting these for every timestamp, wouldn't we want to put as little in the value message as possible - metadata like name description and unit could be stored elsewhere.

Are attributes holding the numbers? If so, why are they typed string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this different than the key?
From this <metric_name> and are used form a key.

So, for a given timestamp window, I just store Metric once, and when we emit the metric, that's the time it is used. Reaming, time for each incoming message, to save serde cost, I used metric_key for aggregating value in a different store.

Are attributes holding the numbers? If so, why are they typed string?
This represents the Prometheus metric name and its labels - https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

And, they are strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for a given timestamp window, I just store Metric once, and when we emit the metric, that's the time it is used. Reaming, time for each incoming message, to save serde cost, I used metric_key for aggregating value in a different store.

Here I meant, that we're storing the attributes and name in every stored metric timewindow, as I understand it. Is that to support querying by attribute and name, not just the id as I was asking about in the other comment thread?

This represents the Prometheus metric name and its labels - https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

OK that makes sense, that's what I expected - but where is the actual metric value itself in this message? or it elsewhere? The only numeric-typed thing I see is the timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of this PR, I am aggregating num_calls metrics to give a time window (say 15s). During this aggregation, I am storing the intermediate result in the state store <MetricIdentity, Value> while punctuating, I am emitting this value after converting to otlp metrics.

return new OtlpMetricsSerde.De();
}

public static class Ser implements Serializer<ResourceMetrics> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use real class names if making these public (and make them private if possible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this.

Copy link
Contributor Author

@kotharironak kotharironak Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kotharironak
Copy link
Contributor Author

@JBAhire Can you help in disabling avro compatibility check for this project - see the diff. I guess we will have to merge a separate PR for pull_request_target first to disable that, right?

Yup. We will need a separate PR for that one.

I'd recommend pulling the compatibility checks into a separate job altogether. Require it in the same way, but that way we can override it if needed without code changes - basically what we do with proto.

Created this PR - #289 - for this.

@github-actions

This comment has been minimized.

@codecov
Copy link

codecov bot commented Nov 22, 2021

Codecov Report

Merging #286 (b7b4a84) into main (fcd88a5) will increase coverage by 0.29%.
The diff coverage is 88.30%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #286      +/-   ##
============================================
+ Coverage     79.58%   79.87%   +0.29%     
- Complexity     1265     1285      +20     
============================================
  Files           112      116       +4     
  Lines          4937     5108     +171     
  Branches        451      458       +7     
============================================
+ Hits           3929     4080     +151     
- Misses          806      820      +14     
- Partials        202      208       +6     
Flag Coverage Δ
unit 79.87% <88.30%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...hypertrace/metrics/generator/MetricsConstants.java 0.00% <0.00%> (ø)
...hypertrace/metrics/generator/MetricsGenerator.java 74.35% <74.35%> (ø)
...hypertrace/metrics/generator/MetricsProcessor.java 91.30% <91.30%> (ø)
...rtrace/metrics/generator/MetricEmitPunctuator.java 95.16% <95.16%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcd88a5...b7b4a84. Read the comment docs.

@kotharironak
Copy link
Contributor Author

kotharironak commented Nov 22, 2021

will look into build-and-test-with-mongo and snyk failures.

@github-actions

This comment has been minimized.

attributes.put(SERVICE_ID_ATTR, value.getServiceId());
attributes.put(SERVICE_NAME_ATTR, value.getServiceName());
attributes.put(API_ID, value.getApiId());
attributes.put(API_NAME, value.getApiName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding status_code and protocol all string-related attributes other than IDs.

Copy link
Contributor Author

@kotharironak kotharironak Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

import org.apache.kafka.common.serialization.Serializer;
import org.hypertrace.metrics.generator.api.v1.Metric;

public class MetricSerde implements Serde<Metric> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I have added custom protobuf serde for Metric, MetricIdentity, and OtlpMetrics. I am facing some issues while using kafka stream protobuf serde - had logged the ticket - hypertrace/hypertrace#312

Working on this as a follow-up. This will remove these custom objects

@Override
public KeyValue<byte[], ResourceMetrics> transform(String key, RawServiceView value) {
// construct metric attributes & metric
Map<String, String> attributes = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed we are doing application level aggregation versus using inbuilt aggregation by prometheus, since prometheus polls metrics at fixed duration instead of realtime consumption

Copy link
Contributor

@findingrish findingrish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as this is blocker for subsequent changes, any further changes can be taken up as followup

@kotharironak
Copy link
Contributor Author

@aaron-steinfeld Pl. let me know if you have any other comments, will take care of them in follow-up.

@kotharironak kotharironak merged commit 503b369 into main Nov 29, 2021
@kotharironak kotharironak deleted the metrics-extractor branch November 29, 2021 07:46
@github-actions
Copy link

Unit Test Results

  75 files  +1    75 suites  +1   1m 9s ⏱️ +8s
393 tests +1  393 ✔️ +1  0 💤 ±0  0 ❌ ±0 

Results for commit 503b369. ± Comparison against base commit fcd88a5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants