Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus receiver does not collect _bucket and _sum when explicitly mentioned in scraping config #36060

Closed
xzizka opened this issue Oct 29, 2024 · 2 comments
Labels
bug Something isn't working needs triage New item requiring triage receiver/prometheus Prometheus receiver

Comments

@xzizka
Copy link

xzizka commented Oct 29, 2024

Component(s)

receiver/prometheus

What happened?

Description

We want to use OTEL for metrics collection in our Kubernetes environmet and then forward them to Prometheus. For this case we use Prometehus receiver and prometheusremotewrite. We found, that Prometheus receiver is not able to scrape some metrics when they are explicitly mentioned in the scraping config, e.g.:

  • kubelet_pleg_relist_duration_seconds_bucket
  • kubelet_pleg_relist_duration_seconds_sum
  • kubelet_pleg_relist_interval_seconds_bucket
  • kubelet_pleg_relist_interval_seconds_sum

Receiver is able to scrape them, when wildcard regex is used, e.g.:

  • kubelet_pleg_relist_.*
  • kubelet_pleg_relist_interval_seconds_.*
  • kubelet_pleg_relist_duration_seconds_.*

On the other side, we are able to scrape metrics:

  • kubelet_pleg_relist_interval_seconds_count
  • kubelet_pleg_relist_duration_seconds_count

When they are explicitly mentioned in scrpaing config.

Prometheus Agent (another tool we have comparison to) is able to scrape all the mentioned metrics without any issue (both wth exact name or with wildcard config).

Steps to Reproduce

First of all... These metrics exist within the cluster (output truncated):

user@testvm:~/otel-logs $ for node in node1 node2 node3 node4 node5 node6
do   
    echo "--${node}--"
   kubectl get --raw /api/v1/nodes/${node}/proxy/metrics | grep kubelet_pleg_relist 
done

--node1--
# HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_duration_seconds histogram
kubelet_pleg_relist_duration_seconds_bucket{le="0.005"} 2.967972e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.01"} 3.018779e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.025"} 3.034967e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.05"} 3.035511e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.1"} 3.03555e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.25"} 3.035554e+06
kubelet_pleg_relist_duration_seconds_bucket{le="0.5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="1"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="2.5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="5"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="10"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_bucket{le="+Inf"} 3.035556e+06
kubelet_pleg_relist_duration_seconds_sum 6549.903733513051
kubelet_pleg_relist_duration_seconds_count 3.035556e+06
# HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_seconds histogram
kubelet_pleg_relist_interval_seconds_bucket{le="0.005"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.01"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.025"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.05"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.25"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="0.5"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="1"} 0
kubelet_pleg_relist_interval_seconds_bucket{le="2.5"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="5"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="10"} 3.035553e+06
kubelet_pleg_relist_interval_seconds_bucket{le="+Inf"} 3.035555e+06
kubelet_pleg_relist_interval_seconds_sum 3.0441046878185053e+06
kubelet_pleg_relist_interval_seconds_count 3.035555e+06
---node2---

When I use the following config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-monitoring-collector-conf
  namespace: otel-system
  labels:
    app: opentelemetry
    component: otel-monitoring-collector-conf
data:
  otel-monitoring-collector-config: |
    exporters:
      prometheusremotewrite:
        endpoint: https://prometheus-dev:28080/api/v1/push
        tls:
          insecure_skip_verify: true
        headers: 
          X-Scope-OrgID: k8s-nprod-otel
        external_labels:
          cluster: "k8s-nprod-2856"
          otel_component: "otel-collector"
      debug/metrics:
        verbosity: detailed
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_duration_seconds_bucket
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
      batch/metrics:
      memory_limiter/metrics:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 20
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    service:
      extensions: [health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter/metrics, batch/metrics]
          exporters: [debug/metrics, prometheusremotewrite]

I don't see any output in debug log from collector:

user@testvm:~/otel-logs $ kubectl logs -l component=otel-collector --follow | grep pleg
^C
user@testvm:~/otel-logs $

Also no metrics are available through Grafana.
image

When I change the config to a wildcard (truncated):

...
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_.*
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
...

I can see this output in collector pod logs:

user@testvm:~/otel-logs $ k logs -l component=otel-collector --follow | grep pleg
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
     -> Name: kubelet_pleg_relist_duration_seconds
     -> Name: kubelet_pleg_relist_interval_seconds
...

And all the metrics in Grafana.
image

My conclusion

The following metrics can be scraped directly by name:

  • kubelet_pleg_relist_duration_seconds_count
  • kubelet_pleg_relist_interval_seconds_count

Following metrics are collected only if wildcard is used for the metrics

  • kubelet_pleg_relist_duration_seconds_bucket
  • kubelet_pleg_relist_duration_seconds_sum
  • kubelet_pleg_relist_interval_seconds_bucket
  • kubelet_pleg_relist_interval_seconds_sum

These wildcards work correctly and returns all the expected metrics:

  • kubelet_pleg_relist_.*
  • kubelet_pleg_relist_interval_seconds_.*
  • kubelet_pleg_relist_duration_seconds_.*

Expected Result

Be able to collect all the metrics explicitely mentioned in scraping config.

Actual Result

Prometheus receiver is able to scrape all expected metrics when using regex wildcard, but not when the exact name is used.

Collector version

0.112.0-amd

Environment information

Environment

Kubernetes 1.29, 1.30

OpenTelemetry Collector configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-monitoring-collector-conf
  namespace: otel-system
  labels:
    app: opentelemetry
    component: otel-monitoring-collector-conf
data:
  otel-monitoring-collector-config: |
    exporters:
      prometheusremotewrite:
        endpoint: https://prometheus-dev:28080/api/v1/push
        tls:
          insecure_skip_verify: true
        headers: 
          X-Scope-OrgID: k8s-nprod-otel
        external_labels:
          cluster: "k8s-nprod-2856"
          otel_component: "otel-collector"
      debug/metrics:
        verbosity: detailed
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: integrations/kubernetes/kubelet
            scrape_interval: 15s
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            kubernetes_sd_configs:
                - role: node
            metric_relabel_configs:
                - source_labels: [__name__]
                  regex: kubelet_pleg_relist_duration_seconds_bucket
                  action: keep
                - action: labeldrop
                  regex: container_id|id|image_id|uid
            relabel_configs:
                - replacement: kubernetes.default.svc.cluster.local:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$${1}/proxy/metrics
                  source_labels:
                    - __meta_kubernetes_node_name
                  target_label: __metrics_path__
            scheme: https
            tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: false
                server_name: kubernetes
    processors:
      batch/metrics:
      memory_limiter/metrics:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 20
    extensions:
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    service:
      extensions: [health_check]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [memory_limiter/metrics, batch/metrics]
          exporters: [debug/metrics, prometheusremotewrite]

Log output

Log output pasted in the "Steps to Reproduce" part.

Additional context

Moved from open-telemetry/opentelemetry-collector#11533 to this repo.

I did a few more tests with these metrics and its collection.
The OTEL config is the same like mentioned above. The only change is the regex in the matrics scraping configuration:

  1. regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count)' - both metrics collected
  2. regex: '(kubelet_pleg_relist_duration_seconds_bucket)' - no metric collected
  3. regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_sum)' - no metric collected
  4. regex: '(kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_duration_seconds_sum)' - all 3 metrics collected
  5. regex: '(kubelet_pleg_relist_duration_seconds_sum)' - no metric collected
  6. regex: '(kubelet_pleg_relist_interval_seconds_bucket|kubelet_pleg_relist_interval_seconds_sum)' - no metric collected
  7. regex: '(kubelet_pleg_relist_interval_seconds_sum|kubelet_pleg_relist_interval_seconds_count)' - All kubelet_pleg_relist_interval_seconds_-.* metrics collected (3 metrics, not 2)
  8. regex: '(kubelet_pleg_relist_interval_seconds_sum)' - no metric collected
  9. regex: '(kubelet_pleg_relist_interval_seconds_bucket)' - no metric collected
  10. regex: '(kubelet_pleg_relist_interval_seconds_bucket|kubelet_pleg_relist_interval_seconds_count|kubelet_pleg_relist_interval_seconds_sum)' - all 3 metrics collected

Especially case 7 where is scraping set for 2 metrics, but 3 metrics are scraped... I tried several times to change it to something different (which produced different outputs) and then back to the number 7 and it always led to the same 3 metrics even if the scraping was set for 2.

@Aneurysm9, @dashpole, do you please have any idea?
Thank you

@xzizka xzizka added bug Something isn't working needs triage New item requiring triage labels Oct 29, 2024
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label Oct 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@xzizka
Copy link
Author

xzizka commented Oct 29, 2024

The OTEL functionality was explained in this post (#36061 (comment)).
It explains the behaviour mentioned in this issue, so I close it with a link to the original response. Thank you @dashpole for your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage receiver/prometheus Prometheus receiver
Projects
None yet
Development

No branches or pull requests

1 participant