Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostmetrics receiver duplicates filesystem metrics on GKE #34512

Closed
tcolgate opened this issue Aug 8, 2024 · 2 comments · Fixed by #34635
Closed

hostmetrics receiver duplicates filesystem metrics on GKE #34512

tcolgate opened this issue Aug 8, 2024 · 2 comments · Fixed by #34635
Labels
bug Something isn't working receiver/hostmetrics

Comments

@tcolgate
Copy link
Contributor

tcolgate commented Aug 8, 2024

Component(s)

receiver/hostmetrics

What happened?

Description

When running in GKE system.filesystem.inodes.usage and system.filesystem.usage duplicate metrics for
mountpoint=/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet, along with other pod specific mountpoints under...

  • /home/kubernetes/containerized_mounter/rootfs/
  • /var/lib/kubelet/pods/
  • /var/lib/kubelet/plugins/
    Not all pods have the duplicated data, it appears to be more prevalent on pods that are using CSI plugins.

Steps to Reproduce

Expected Result

Metrics should be collected without duplicates.

Actual Result

One of the detected mountpoints appears twice in the metrics. This then causes issues when metrics are passed to external metrics providers like Google.

...
Descriptor:
     -> Name: system.filesystem.usage                                                                            
     -> Description: Filesystem bytes used.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> device: Str(/dev/dm-0)
     -> mode: Str(ro)
     -> mountpoint: Str(/)
     -> type: Str(ext2)
     -> state: Str(used)
...
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #42
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #43
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #44
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #45
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
...
Descriptor:
     -> Name: system.filesystem.usage
     -> Description: Filesystem bytes used.
     -> Unit: By
     -> DataType: Sum
     -> IsMonotonic: false                                                                                       
     -> AggregationTemporality: Cumulative
...
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #42
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #43
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #44
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 16777216
NumberDataPoints #45
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(used)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 7855972352
NumberDataPoints #46
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(free)
StartTimestamp: 2024-08-07 10:21:52 +0000 UTC
Timestamp: 2024-08-07 16:21:42.5274022 +0000 UTC
Value: 93331124224
NumberDataPoints #47
Data point attributes:
     -> device: Str(/dev/sda1)
     -> mode: Str(rw)
     -> mountpoint: Str(/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet)
     -> type: Str(ext4)
     -> state: Str(reserved)

when coupled with the googlemanagedprometheus exporter we get the following error:

{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-08-07T16:21:42.771Z	error	exporterhelper/queue_sender.go:90	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlemanagedprometheus", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[77] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[31] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[30] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[79] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.; Field timeSeries[78] had an invalid value: Duplicate TimeSeries encountered. Only one point can be written per TimeSeries per request.\nerror details: name = Unknown  desc = total_point_count:200  success_point_count:195  errors:{status:{code:3}  point_count:5}", "dropped_items": 287}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.104.0/exporterhelper/queue_sender.go:90
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/consumers.go:43

Collector version

otelcol-contrib version 0.105.0

Environment information

Environment

OS: google container OS
Compiler: official docker container image ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114

OpenTelemetry Collector configuration

# slightly trimmed down
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-node-test
  namespace: kube-system
spec:
  args:
    feature-gates: exporter.googlemanagedpromethues.intToDouble,-component.UseLocalHostAsDefaultHost
  config:
    exporters:
      debug: {}
    processors:
      batch:
        send_batch_max_size: 11000
        send_batch_size: 10000
        timeout: 5s
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
          - k8s.pod.name
          - k8s.pod.uid
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.start_time
          - k8s.container.name
        filter:
          node_from_env_var: NODE_NAME
        passthrough: false
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name
          - from: resource_attribute
            name: k8s.container.name
        - sources:
          - from: connection
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
      resource:
        attributes:
        - action: insert
          key: environment
          value: staging
        - action: insert
          key: k8s.node.name
          value: ${env:NODE_NAME}
        - action: insert
          key: k8s.namespace.name
          value: ${env:NAMESPACE}
      resource/hostmetrics:
        attributes:
        - action: insert
          key: job
          value: otel-node-collector
        - action: insert
          key: namespace
          value: ${env:NAMESPACE}
      resourcedetection/gcp:
        detectors:
        - gcp
        override: false
        timeout: 2s
      transform/hostmetrics:
        error_mode: ignore
        metric_statements:
        - context: resource
          statements:
          - set(attributes["node"], attributes["k8s.node.name"])
          - set(attributes["pod"], attributes["k8s.pod.name"])
          - set(attributes["container"], attributes["k8s.container.name"])
      transform/metrics:
        metric_statements:
        - context: datapoint
          statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
    receivers:
      hostmetrics:
        collection_interval: 10s
        root_path: /hostfs
        scrapers:
          cpu: null
          disk: null
          filesystem: null
          load: null
          memory: null
          network: null
    service:
      pipelines:
        metrics/hostmetrics:
          exporters:
          - debug
          processors:
          - resource/hostmetrics
          - resourcedetection/gcp
          - resource
          - filter/noiseymetrics
          - transform/hostmetrics
          receivers:
          - hostmetrics
          - kubeletstats
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  env:
  - name: POD_IP
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: status.podIP
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
  - name: NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace
  image: europe-west6-docker.pkg.dev/cerbos-registry/spitfire/imported/ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114
  ingress:
    route: {}
  ipFamilyPolicy: SingleStack
  managementState: managed
  mode: daemonset
  observability:
    metrics: {}
  podDisruptionBudget:
    maxUnavailable: 1
  podDnsConfig: {}
  priorityClassName: system-node-critical
  replicas: 1
  resources: {}
  securityContext:
    runAsGroup: 0
    runAsUser: 0
  serviceAccount: kube-system-otel
  tolerations:
  - effect: NoSchedule
    operator: Exists
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /var/lib/otelcol
    name: varlibotelcol
  - mountPath: /etc/prometheus/certs
    name: tls-assets
    readOnly: true
  - mountPath: /hostfs
    mountPropagation: HostToContainer
    name: hostfs
    readOnly: true
  volumes:
  - hostPath:
      path: /var/lib/otelcol
      type: DirectoryOrCreate
    name: varlibotelcol
  - name: tls-assets
    projected:
      defaultMode: 420
      sources:
      - secret:
          name: prometheus-otel-prom-config-tls-assets-0
  - hostPath:
      path: /
    name: hostfs
status:
  image: europe-west6-docker.pkg.dev/cerbos-registry/spitfire/imported/ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib@sha256:3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea88114
  scale:
    selector: app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=kube-system.otel-node,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=otel-node-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=3ff721e65733a9c2d94e81cfb350e76f1cd218964d5608848e2e73293ea8811
  version: 0.105.0

Log output

See "what happened"

Additional context

No response

@tcolgate tcolgate added bug Something isn't working needs triage New item requiring triage labels Aug 8, 2024
Copy link
Contributor

github-actions bot commented Aug 8, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@tcolgate
Copy link
Contributor Author

By way of further debugging. Checking /proc/1/mountinfo (used by the imported shirou/goputi), looking for one of the duplicating .../globalmount mountpoints, we see

/ # grep 095e/globalmount /proc/1/mountinfo
10048 9964 8:64 / /hostfs/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10238 10091 8:64 / /hostfs/home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10415 10288 8:64 / /hostfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw
10582 10282 8:64 / /hostfs/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/fd92a65917e16239faa6804bb34ba2adc94a7b432062ba3933ebae386eaa095e/globalmount rw,relatime master:2417 - ext4 /dev/sde rw

The mounts are paths that are mounted to the same location but (I think), under different namespaces, presumably the same data into two different pods.

I think it would be valid to one export the metrics once per unique path (they should all have the same filesystem level metrics). Though, equally, it's not obvious that metrics for these mounts are useful at all.

I'm working around the issue locally by dropping metrics for these paths, (there's a good chance I'd drop these anyway, they aren't terribly useful), but fixing the duplication in the hostmetrics receiver seems fair.

tcolgate added a commit to tcolgate/opentelemetry-collector-contrib that referenced this issue Aug 13, 2024
Mountpoints can be reported multiple times for each mount into a
namespace. This causes duplicate metrics which causes issues with
some exporters. Each instance of the mountpoint will have identical
metrics, so it is safe to ignore repeated mountpoints.

Closes open-telemetry#34512
tcolgate added a commit to tcolgate/opentelemetry-collector-contrib that referenced this issue Aug 15, 2024
@atoulme atoulme removed the needs triage New item requiring triage label Oct 2, 2024
jmichalek132 pushed a commit to jmichalek132/opentelemetry-collector-contrib that referenced this issue Oct 10, 2024
…elemetry#34635)

Mountpoints can be reported multiple times for each mount into a
namespace. This causes duplicate metrics which causes issues with some
exporters. Each instance of the mountpoint will have identical metrics,
so it is safe to ignore repeated mountpoints.

Closes open-telemetry#34512

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…elemetry#34635)

Mountpoints can be reported multiple times for each mount into a
namespace. This causes duplicate metrics which causes issues with some
exporters. Each instance of the mountpoint will have identical metrics,
so it is safe to ignore repeated mountpoints.

Closes open-telemetry#34512

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

**Link to tracking Issue:** <Issue number if applicable>

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/hostmetrics
Projects
None yet
2 participants