Build health metrics dimensions #952

mrh666 · 2024-09-23T09:52:30Z

What feature do you want to see added?

@cyrille-leclerc on that page https://plugins.jenkins.io/opentelemetry/ there is a screenshot of kibana https://raw.githubusercontent.com/jenkinsci/opentelemetry-plugin/master/docs/images/kibana_jenkins_overview_dashboard.png with all graphs I need, e.g. job duration, failed steps, long steps, etc. How can we get those metrics exported to Dynatrace?

Upstream changes

No response

Are you interested in contributing this feature?

No response

cyrille-leclerc · 2024-09-24T08:06:29Z

Report per pipeline

A solution we are looking at could be to produce an histogram metric per pipeline and per result. It would be inspired by the standardized http.server.request.duration metric.

I was waiting for the OTel CI/CD SIG to standardize such a metric, work is in progress with:

CI/CD conventions for metrics open-telemetry/semantic-conventions#1111

⚠️ I'm worried about the cardinality of such a metric as we can potentially produce 5 x count(pipeline) histograms which is a lot.

@christophe-kamphaus-jemmic did you put thoughts on such metrics?

The metric could look like

ci.pipeline.run.duration: histogram {
   // Pipeline full name
   // See org.jenkinsci.plugins.workflow.job.WorkflowJob#getFullName()
   ci.pipeline.id="/my-team/my-war/master",
   // see hudson.model.Run#getResult() 
   // SUCCESS, UNSTABLE, FAILURE, NOT_BUILT, ABORTED
   ci.pipeline.run.result="SUCCESS"
}

Report per pipeline step

High cardinality problems look even more a risk here. I'm wondering if we should not stick to solve this doing metrics queries on the traces similar to what TraceQL metrics queries offer.

Controlling cardinality

I'm thinking of helping Jenkins admins control cardinality of such metrics enabling allow & deny lists of pipeline names as we have seen Jenkins instances with thousands of pipelines.

@mrh666 Is it he kind of ideas you had in mind?

mrh666 · 2024-09-24T11:16:23Z

@cyrille-leclerc that's exactly what I have in mind!

I'm worried about the cardinality of such a metric as we can potentially produce 5 x count(pipeline) histograms which is a lot.

You have reasonable worries about cardinality. In influx world it's easily can kill all the DB performance. But just make it optional. Something like otel.exporter.otlp.metrics.build_health.enabled

if we should not stick to solve this doing metrics queries on the traces similar to

In Dynatrace world it's impossible or close to impossible. I've digging into such a functionality and not achieved any results.

cyrille-leclerc · 2024-09-24T12:01:40Z

Thanks @mrh666 . Can you please share with us:

Total count of pipelines
Count of pipelines for which you want performance metrics
Could it be possible with allo/deny lists based on regex to just collect metrics on the pipelines that matter to you?

Same question for build steps

mrh666 · 2024-09-24T17:02:06Z

In the current project:
We have 24 pipelines running at the moment
12 of those required pipeline metrics

Could it be possible with allo/deny lists based on regex to just collect metrics on the pipelines that matter to you?

This one is really important!

cyrille-leclerc · 2024-10-01T10:41:44Z

Here is a proposal:

Allow and Deny lists using regex to specify the job names for which we create a time series to control cardinality
Histogram metric

ci.pipeline.run.duration: unit=second {
   ci.pipeline.id: if (in-allow-lit && ! in-deny-list) ?
      hudson.model.Job.getParent().getFullName() :
      "#other#"
   ci.pipeline.run.result: hudson.model.Result
   ci.pipeline.run.completed: hudson.model.Result.isCompleted()

}

See:

Add ci.pipeline.run.duration metric #959

Feedback welcome cc @mrh666

christophe-kamphaus-jemmic · 2024-10-02T15:09:19Z

I was waiting for the OTel CI/CD SIG to standardize such a metric

Indeed work in the OTel CI/CD SIG related to metrics is in progress.
We are currently standardizing metrics related to VCS and we plan to follow that up with metrics related to pipelines, queues and agents.
ci.pipeline.run.duration: histogram sounds like a good metric. I will propose it in the SIG.

One issue I can see with using a histogram is that the chosen buckets might not give enough insight to take any action. Some jobs might be of very short duration, while others could take hours or even days to complete.
So how would you define the buckets?

I had very good success in using metrics queries on traces/spans for job duration as well as stage duration using the steps I introduced in #827 (example use in a pipeline here: #811 (comment)). This allowed me to have very detailed statistics (eg. average duration per day or job) and is filterable per job.

I'm thinking of helping Jenkins admins control cardinality of such metrics enabling allow & deny lists of pipeline names as we have seen Jenkins instances with thousands of pipelines.

For sure cardinality is an issue when the number of time series scales with a dynamic value like the number of jobs managed by Jenkins. It's not as bad as when we would have a separate time series per build, but still it needs to be managed. (prometheus-plugin has per-build metrics guarded by a checkbox config option)

Controlling which jobs generate this metric on Jenkins-side I think is a very good option.

Alternatively it's also possible to filter later:

using metric relabelling rules in Prometheus or ServiceMonitor

with opentelemetry-collector a filterprocessor could be used

processors:
  filter/ci:
    error_mode: ignore
    metrics:
      metric:
          - 'name == "ci.pipeline.run.duration" and not(IsMatch(attributes["ci.pipeline.id"], "my-otel-pipelines.*"))'

cyrille-leclerc · 2024-10-25T07:49:53Z

Cc @miraccan00

cyrille-leclerc · 2024-11-06T07:25:50Z

Please use the ci.pipeline.run.duration{ci.pipeline.id="<<pipeline full name>>", ci.pipeline.result="<<SUCCESS, UNSTABLE, FAILURE, NOT_BUILT, ABORTED>>"} histogram metric we have just released.
ℹ Use the otel.instrumentation.jenkins.run.metric.duration.allow_list and otel.instrumentation.jenkins.run.metric.duration.deny_list to specify the pipelines for which you want to capture the run duration, other pipelines will be aggregated in the ci.pipeline.id="#other#" time series.

See documentation https://github.com/jenkinsci/opentelemetry-plugin/blob/main/docs/monitoring-metrics.md#build-duration

I'm marking your enhancement request as solved. Please open new enhancement requests if needed.

mrh666 · 2024-11-28T15:02:15Z

@cyrille-leclerc Thank you! Now I've been trying to use it with Jenkins. Here is config line from jenkins:

Nov 28, 2024 2:40:44 PM FINE io.jenkins.plugins.opentelemetry.jenkins.OpenTelemetryConfigurerComputerListener$OpenTelemetryConfigurerMasterToSlaveCallable

Configure OpenTelemetry SDK with properties: {otel.exporter.otlp.timeout=30000, otel.exporter.otlp.endpoint=https://dynatrace-otel-collector.xxxx.com, otel.exporter.otlp.protocol=http/protobuf, jenkins.url=https://xxxx/, jenkins.version=2.462.1, otel.metrics.exporter=otlp, otel.instrumentation.jenkins.run.metric.duration.allow_list=".*/.*", otel.traces.exporter=otlp, service.instance.id=xxxx, otel.instrumentation.jenkins.remote.span.enabled=true, otel.instrumentation.jenkins.agent.enabled=true, otel.imr.export.interval=30000, otel.instrumentation.jenkins.remoting.enabled=true, otel.exporter.otlp.{signal}.protocol=http/protobuf, otel.exporter.otlp.metrics.temporality.preference=DELTA, otel.java.disabled.resource.providers=io.opentelemetry.instrumentation.resources.ProcessResourceProvider}, resource:{jenkins.opentelemetry.plugin.version=3.1423.v0d1a_2fcd2429, service.name=jenkins, jenkins.computer.name=xxxx, service.namespace=jenkins}

I changed those params a lot of times. But I can't trigger the metric. Like:

otel.instrumentation.jenkins.run.metric.duration.allow_list=".*/.*"
otel.instrumentation.jenkins.run.metric.duration.allow_list=".*"
otel.instrumentation.jenkins.run.metric.duration.allow_list=.*
otel.instrumentation.jenkins.run.metric.duration.allow_list=/.*/

Am I doing wrong with regex?

And another point - please fix the documentation https://github.com/jenkinsci/opentelemetry-plugin/blob/main/docs/monitoring-metrics.md#build-duration:

You wrote:

Configuration parameters to control the cardinality of the ci.pipeline.id attribute:

    otel.instrumentation.jenkins.run.metric.duration.allow_list: Java regex, default value: $^ (ie match nothing). Example jenkins_folder_a/.*|jenkins_folder_b/.*
    otel.instrumentation.jenkins.run.metric.duration.deny_list: Java regex, default value: $^ (ie match nothing). Example .*test.*

Should be I believe:

Configuration parameters to control the cardinality of the ci.pipeline.id attribute:

    otel.instrumentation.jenkins.run.metric.duration.allow_list: Java regex, default value: ^$ (ie match nothing). Example jenkins_folder_a/.*|jenkins_folder_b/.*
    otel.instrumentation.jenkins.run.metric.duration.deny_list: Java regex, default value: ^$ (ie match nothing). Example .*test.*

cyrille-leclerc · 2024-11-28T22:34:00Z

Thanks for testing @mrh666 .
Can you please provide a list of example job full display names, including the parent folders + the filter you want to apply so we can test?

For the documentation and ^$ vs $^, please review

https://github.com/jenkinsci/opentelemetry-plugin/pull/993/files#r1863138803

mrh666 · 2024-11-30T06:48:52Z

@cyrille-leclerc For example, the simple job: https://xxxxxxx/job/telemetry%20test%20pipe/
And I want to have metrics from this job only. Something like '^.*telemetry.*$' in allow_list

cyrille-leclerc · 2024-12-05T08:04:36Z

@mrh666 please just configure:

otel.instrumentation.jenkins.run.metric.duration.allow_list=.*telemetry.*

Don't put ^ or $.
I successfully tested with a pipeline called /telemetry/test pipe

mrh666 · 2024-12-05T16:07:12Z

@cyrille-leclerc Maybe I'm doing something wrong? Here is settings:

Here is all metrics coming to Dynatrace perfectly except ci.pipeline.run.duration

Versions:
Jenkins 2.462.1
OpenTelemetry API Plugin 1.43.0-38.v1a_9b_53e3f70f
OpenTelemetry Plugin 3.1423.v0d1a_2fcd2429

cyrille-leclerc · 2024-12-05T18:05:26Z

Can you verify that you have a histogram metric ci.pipeline.duration, at least reporting the attribute ci.pipeline.id=#other# ? I cannot see this in your screenshot.
IF you don't se this metric, then the problem is broader than the otel.instrumentation.jenkins.run.metric.duration.allow_list parameter.

mrh666 · 2024-12-09T10:43:44Z

@cyrille-leclerc
Nope, I can't see the metric ci.pipeline.duration in logs or in target Dynatrace.
Yes, I can see the attribute you mentioned in trace:

Dec 09, 2024 10:37:42 AM FINE io.jenkins.plugins.opentelemetry.job.action.AbstractMonitoringAction

Purge span='BUILD telemetry test pipe', spanId=acdda7a6b8db2192, traceId=232f13dd225862520dc9383406c37be1: SpanAndScopes{span=SdkSpan{traceId=232f13dd225862520dc9383406c37be1, spanId=acdda7a6b8db2192, parentSpanContext=ImmutableSpanContext{traceId=232f13dd225862520dc9383406c37be1, spanId=51eff770ee371c5f, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=true, valid=true}, name=BUILD telemetry test pipe, kind=SERVER, attributes=AttributesMap{data={type=job, ci.pipeline.run.completed=true, ci.pipeline.name=telemetry test pipe, ci.pipeline.run.committers=[], ci.pipeline.run.cause=[UserIdCause:my@mail], ci.pipeline.run.url=https://xxxxx/job/telemetry%20test%20pipe/197/, ci.pipeline.type=workflow, ci.pipeline.run.result=SUCCESS, ci.pipeline.run.number=197, ci.pipeline.run.durationMillis=1674, ci.pipeline.id=telemetry test pipe}, capacity=128, totalAddedValues=11}, status=ImmutableStatusData{statusCode=OK, description=SUCCESS}, totalRecordedEvents=0, totalRecordedLinks=0, startEpochNanos=1733740660699406857, endEpochNanos=1733740662393067097}, scopes=0, scopeStartThreadName='Executor #-1 for Built-In Node : executing telemetry test pipe #197'}

What could be the cause of it?

cyrille-leclerc · 2024-12-09T13:17:31Z

@mrh666 it seems that Dynatrace just introduced support for OTel histogram metrics:
https://www.dynatrace.com/news/blog/opentelemetry-histograms-reveal-patterns-outliers-and-trends/
Could your version precede this improvement?

mrh666 added the enhancement New feature or request label Sep 23, 2024

cyrille-leclerc mentioned this issue Sep 24, 2024

Duration as span attribute for Pipeline Step Span #940

Closed

christophe-kamphaus-jemmic mentioned this issue Oct 26, 2024

Support for Retrieving Jenkins Build Duration Metrics via OpenTelemetry Plugin #972

Closed

kuisathaverat added question Further information is requested and removed enhancement New feature or request labels Oct 28, 2024

cyrille-leclerc closed this as completed Nov 6, 2024

cyrille-leclerc reopened this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build health metrics dimensions #952

Build health metrics dimensions #952

mrh666 commented Sep 23, 2024

cyrille-leclerc commented Sep 24, 2024 •

edited

Loading

mrh666 commented Sep 24, 2024 •

edited

Loading

cyrille-leclerc commented Sep 24, 2024

mrh666 commented Sep 24, 2024 •

edited

Loading

cyrille-leclerc commented Oct 1, 2024 •

edited

Loading

christophe-kamphaus-jemmic commented Oct 2, 2024

cyrille-leclerc commented Oct 25, 2024

cyrille-leclerc commented Nov 6, 2024

mrh666 commented Nov 28, 2024

cyrille-leclerc commented Nov 28, 2024 •

edited

Loading

mrh666 commented Nov 30, 2024 •

edited

Loading

cyrille-leclerc commented Dec 5, 2024

mrh666 commented Dec 5, 2024

cyrille-leclerc commented Dec 5, 2024

mrh666 commented Dec 9, 2024 •

edited

Loading

cyrille-leclerc commented Dec 9, 2024

Build health metrics dimensions #952

Build health metrics dimensions #952

Comments

mrh666 commented Sep 23, 2024

What feature do you want to see added?

Upstream changes

Are you interested in contributing this feature?

cyrille-leclerc commented Sep 24, 2024 • edited Loading

mrh666 commented Sep 24, 2024 • edited Loading

cyrille-leclerc commented Sep 24, 2024

mrh666 commented Sep 24, 2024 • edited Loading

cyrille-leclerc commented Oct 1, 2024 • edited Loading

christophe-kamphaus-jemmic commented Oct 2, 2024

cyrille-leclerc commented Oct 25, 2024

cyrille-leclerc commented Nov 6, 2024

mrh666 commented Nov 28, 2024

cyrille-leclerc commented Nov 28, 2024 • edited Loading

mrh666 commented Nov 30, 2024 • edited Loading

cyrille-leclerc commented Dec 5, 2024

mrh666 commented Dec 5, 2024

cyrille-leclerc commented Dec 5, 2024

mrh666 commented Dec 9, 2024 • edited Loading

cyrille-leclerc commented Dec 9, 2024

cyrille-leclerc commented Sep 24, 2024 •

edited

Loading

mrh666 commented Sep 24, 2024 •

edited

Loading

mrh666 commented Sep 24, 2024 •

edited

Loading

cyrille-leclerc commented Oct 1, 2024 •

edited

Loading

cyrille-leclerc commented Nov 28, 2024 •

edited

Loading

mrh666 commented Nov 30, 2024 •

edited

Loading

mrh666 commented Dec 9, 2024 •

edited

Loading