Prometheus receiver does not collect _bucket and _sum when explicitly mentioned in scraping config #36060

xzizka · 2024-10-29T11:43:50Z

Component(s)

receiver/prometheus

What happened?

Description

We want to use OTEL for metrics collection in our Kubernetes environmet and then forward them to Prometheus. For this case we use Prometehus receiver and prometheusremotewrite. We found, that Prometheus receiver is not able to scrape some metrics when they are explicitly mentioned in the scraping config, e.g.:

kubelet_pleg_relist_duration_seconds_bucket
kubelet_pleg_relist_duration_seconds_sum
kubelet_pleg_relist_interval_seconds_bucket
kubelet_pleg_relist_interval_seconds_sum

Receiver is able to scrape them, when wildcard regex is used, e.g.:

kubelet_pleg_relist_.*
kubelet_pleg_relist_interval_seconds_.*
kubelet_pleg_relist_duration_seconds_.*

On the other side, we are able to scrape metrics:

kubelet_pleg_relist_interval_seconds_count
kubelet_pleg_relist_duration_seconds_count

When they are explicitly mentioned in scrpaing config.

Prometheus Agent (another tool we have comparison to) is able to scrape all the mentioned metrics without any issue (both wth exact name or with wildcard config).

Steps to Reproduce

First of all... These metrics exist within the cluster (output truncated):

user@testvm:~/otel-logs $ for node in node1 node2 node3 node4 node5 node6
do   
    echo "--${node}--"
   kubectl get --raw /api/v1/nodes/${node}/proxy/metrics | grep kubelet_pleg_relist 
done

--node1--
# HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_duration_seconds histogram
kubelet_pleg_relist_duration_seconds_bucket{le="0.005"} 2.967972e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.01"} 3.018779e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.025"} 3.034967e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.05"} 3.035511e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.1"} 3.03555e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.25"} 3.035554e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="1"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="2.5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="10"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="+Inf"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_sum 6549.903733513051
kubelet_pleg_relist_duration_seconds_count 3.035556e+06
# HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_seconds histogram
kubelet_pleg_relist_interval_seconds_bucket{le="0.005"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.01"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.025"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.05"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.25"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.5"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="2.5"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="5"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="10"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"} 3.035555e+06
kubelet_pleg_relist_interval_seconds_sum 3.0441046878185053e+06
kubelet_pleg_relist_interval_seconds_count 3.035555e+06
---node2---

When I use the following config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-monitoring-collector-conf
  namespace: otel-system
  labels:
    app: opentelemetry
    component: otel-monitoring-collector-conf
data:
  otel-monitoring-collector-config: |
    exporters:
      prometheusremotewrite:
        endpoint: https://prometheus-dev:28080/api/v1/push
        tls:
          insecure_skip_verify: true
        headers: 
          X-Scope-OrgID: k8s-nprod-otel
        external_labels:
          cluster: "k8s-nprod-2856"
          otel_component: "otel-collector"
      debug/metrics:
        verbosity: detailed
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_duration_seconds_bucket
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
      batch/metrics:
      memory_limiter/metrics:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 20
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    service:
      extensions: [health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter/metrics, batch/metrics]
          exporters: [debug/metrics, prometheusremotewrite]

I don't see any output in debug log from collector:

user@testvm:~/otel-logs $ kubectl logs -l component=otel-collector --follow | grep pleg
^C
user@testvm:~/otel-logs $

Also no metrics are available through Grafana.

When I change the config to a wildcard (truncated):

...
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_.*
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
...

I can see this output in collector pod logs:

user@testvm:~/otel-logs $ k logs -l component=otel-collector --follow | grep pleg
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
...

And all the metrics in Grafana.

My conclusion

The following metrics can be scraped directly by name:

kubelet_pleg_relist_duration_seconds_count
kubelet_pleg_relist_interval_seconds_count

Following metrics are collected only if wildcard is used for the metrics

kubelet_pleg_relist_duration_seconds_bucket
kubelet_pleg_relist_duration_seconds_sum
kubelet_pleg_relist_interval_seconds_bucket
kubelet_pleg_relist_interval_seconds_sum

These wildcards work correctly and returns all the expected metrics:

kubelet_pleg_relist_.*
kubelet_pleg_relist_interval_seconds_.*
kubelet_pleg_relist_duration_seconds_.*

Expected Result

Be able to collect all the metrics explicitely mentioned in scraping config.

Actual Result

Prometheus receiver is able to scrape all expected metrics when using regex wildcard, but not when the exact name is used.

Collector version

0.112.0-amd

Environment information

Environment

Kubernetes 1.29, 1.30

OpenTelemetry Collector configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-monitoring-collector-conf
  namespace: otel-system
  labels:
    app: opentelemetry
    component: otel-monitoring-collector-conf
data:
  otel-monitoring-collector-config: |
    exporters:
      prometheusremotewrite:
        endpoint: https://prometheus-dev:28080/api/v1/push
        tls:
          insecure_skip_verify: true
        headers: 
          X-Scope-OrgID: k8s-nprod-otel
        external_labels:
          cluster: "k8s-nprod-2856"
          otel_component: "otel-collector"
      debug/metrics:
        verbosity: detailed
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_duration_seconds_bucket
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
      batch/metrics:
      memory_limiter/metrics:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 20
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    service:
      extensions: [health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter/metrics, batch/metrics]
          exporters: [debug/metrics, prometheusremotewrite]

Log output

Log output pasted in the "Steps to Reproduce" part.

Additional context

Moved from open-telemetry/opentelemetry-collector#11533 to this repo.

I did a few more tests with these metrics and its collection.
The OTEL config is the same like mentioned above. The only change is the regex in the matrics scraping configuration:

regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count)' - both metrics collected
regex: '(kubelet_pleg_relist_duration_seconds_bucket)' - no metric collected
regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_sum)' - no metric collected
regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_duration_seconds_sum)' - all 3 metrics collected
regex: '(kubelet_pleg_relist_duration_seconds_sum)' - no metric collected
regex: '(kubelet_pleg_relist_interval_seconds_bucket|kubelet_pleg_relist_interval_seconds_sum)' - no metric collected
regex: '(kubelet_pleg_relist_interval_seconds_sum|kubelet_pleg_relist_interval_seconds_count)' - All kubelet_pleg_relist_interval_seconds_-.* metrics collected (3 metrics, not 2)
regex: '(kubelet_pleg_relist_interval_seconds_sum)' - no metric collected
regex: '(kubelet_pleg_relist_interval_seconds_bucket)' - no metric collected
regex: '(kubelet_pleg_relist_interval_seconds_bucket|kubelet_pleg_relist_interval_seconds_count|kubelet_pleg_relist_interval_seconds_sum)' - all 3 metrics collected

Especially case 7 where is scraping set for 2 metrics, but 3 metrics are scraped... I tried several times to change it to something different (which produced different outputs) and then back to the number 7 and it always led to the same 3 metrics even if the scraping was set for 2.

@Aneurysm9, @dashpole, do you please have any idea?
Thank you

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-29T11:44:09Z

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

xzizka · 2024-10-29T15:27:48Z

The OTEL functionality was explained in this post (#36061 (comment)).
It explains the behaviour mentioned in this issue, so I close it with a link to the original response. Thank you @dashpole for your answer.

xzizka added bug Something isn't working needs triage New item requiring triage labels Oct 29, 2024

github-actions bot added the receiver/prometheus Prometheus receiver label Oct 29, 2024

xzizka closed this as completed Oct 29, 2024

github-actions bot mentioned this issue Nov 5, 2024

Weekly Report: 2024-10-29 - 2024-11-05 #36187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus receiver does not collect _bucket and _sum when explicitly mentioned in scraping config #36060

Prometheus receiver does not collect _bucket and _sum when explicitly mentioned in scraping config #36060

xzizka commented Oct 29, 2024

github-actions bot commented Oct 29, 2024

xzizka commented Oct 29, 2024