Possible memory leak in Alloy #670

madaraszg-tulip · 2024-04-24T18:59:58Z

What's wrong?

I tried migrating from grafana-agent to alloy, while keeping the flow configuration. While grafana-agent was using ~1.1GB of memory, alloy needs much, much more, and the pod regularly hits OOM.

The instances between 15:00 and 16:30 ran without GOMEMLIMIT, and with higher memory limits.
Currently running with the following settings:

  extraEnv:
    - name: GOMEMLIMIT
      value: 2000MiB
  resources:
    requests:
      memory: 2500Mi
      cpu: 1
    limits:
      memory: 2500Mi

We use grafana-agent / alloy only for downsampling traces, and additionally generating spanmetrics and service graph. Config is attached below. This config (with minimal differences, we used a batch processor in front of the servicegraph connector to work around the missing metrics_flush_interval option) has worked in a stable way with grafana-agent.

Steps to reproduce

Deploy alloy in kubernetes, with helm, as a drop in replacement for grafana-agent.

Feed traces from mimir/tempo/loki/grafana/prometheus

Watch as memory usage grows until pod is killed with OOM.

System information

Linux 5.10.209 aarch64 on EKS

Software version

Grafana Alloy v1.0.0

Configuration

tracing {
  sampling_fraction = 1.0

  write_to = [
    otelcol.processor.k8sattributes.enrich.input,
    otelcol.processor.transform.prepare_spanmetrics.input,
    otelcol.connector.servicegraph.graph.input,
  ]
}

otelcol.receiver.otlp "main" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }

  http {
    endpoint = "0.0.0.0:4318"
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.receiver.jaeger "main" {
  protocols {
    grpc {
      endpoint = "0.0.0.0:14250"
    }

    thrift_http {
      endpoint = "0.0.0.0:14268"
    }

    thrift_compact {
      endpoint = "0.0.0.0:6831"
    }
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.processor.k8sattributes "enrich" {
  output {
    traces = [otelcol.processor.tail_sampling.trace_downsample.input]
  }
}

otelcol.connector.servicegraph "graph" {
  dimensions = ["http.method", "db.system"]

  metrics_flush_interval = "60s"

  store {
    ttl = "30s"
  }

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.transform "prepare_spanmetrics" {
  error_mode = "ignore"

  trace_statements {
    context = "resource"
    statements = [
      `keep_keys(attributes, ["service.name"])`,
    ]
  }

  output {
    traces = [otelcol.connector.spanmetrics.stats.input]
  }
}

otelcol.connector.spanmetrics "stats" {
  histogram {
    unit = "ms"
    exponential {
    }
  }

  exemplars {
    enabled = true
  }

  aggregation_temporality = "CUMULATIVE"
  namespace = "traces_spanmetrics_"
  metrics_flush_interval = "60s"

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.tail_sampling "trace_downsample" {
  policy {
    name = "include-all-errors"
    type = "status_code"

    status_code {
      status_codes = ["ERROR"]
    }
  }

  policy {
    name = "include-all-slow-traces"
    type = "latency"

    latency {
      threshold_ms = 5000
    }
  }

  policy {
    name = "include-all-diagnostic-mode-traces"
    type = "boolean_attribute"

    boolean_attribute {
      key   = "diagnostics_mode"
      value = true
    }
  }

  policy {
    name = "downsample-all-others"
    type = "probabilistic"

    probabilistic {
      sampling_percentage = 1
    }
  }

  output {
    traces = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.attributes "manual_tagging" {

  action {
    action = "insert"
    key    = "garden"
    value  = "global"
  }


  action {
    action = "insert"
    key    = "generated_by"
    value  = "grafana_alloy"
  }

  action {
    action = "insert"
    key    = "grafana_alloy_hostname"
    value  = constants.hostname
  }

  output {
    traces = [otelcol.processor.batch.main.input]
    metrics = [
      otelcol.processor.batch.main.input,
    ]
  }
}

otelcol.processor.batch "main" {
  output {
    metrics = [otelcol.exporter.prometheus.global.input]
    traces  = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo-distributor:4318"
  }
  sending_queue {
    queue_size = 20000
  }
  retry_on_failure {
    enabled          = true
    max_elapsed_time = "2h"
    max_interval     = "1m"
  }
}

otelcol.exporter.prometheus "global" {
  include_target_info = false
  resource_to_telemetry_conversion = true
  gc_frequency = "1h"
  forward_to = [prometheus.remote_write.global.receiver]
}

prometheus.remote_write "global" {
  endpoint {
    url = "http://mimir-distributor:8080/api/v1/push"
    send_native_histograms = true
  }
}

Logs

Containers:
  alloy:
    Container ID:  containerd://543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84
    Image:         docker.io/grafana/alloy:v1.0.0
    Image ID:      docker.io/grafana/alloy@sha256:21248ad12831ad8f7279eb40ecd161b2574c2194ca76e7413996666d09beef6c
    Ports:         12345/TCP, 4317/TCP, 4318/TCP, 14250/TCP, 14268/TCP, 6831/UDP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Args:
      run
      /etc/alloy/config.alloy
      --storage.path=/tmp/alloy
      --server.http.listen-addr=0.0.0.0:12345
      --server.http.ui-path-prefix=/
      --stability.level=generally-available
    State:          Running
      Started:      Wed, 24 Apr 2024 20:46:12 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 24 Apr 2024 20:37:28 +0200
      Finished:     Wed, 24 Apr 2024 20:45:39 +0200
    Ready:          True
    Restart Count:  3
    Limits:
      memory:  2500Mi
    Requests:
      cpu:      1
      memory:   2500Mi
    Readiness:  http-get http://:12345/-/ready delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ALLOY_DEPLOY_MODE:  helm
      HOSTNAME:            (v1:spec.nodeName)
      GOMEMLIMIT:         2000MiB
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
  config-reloader:
    Container ID:  containerd://ebf2f2409762dccace3af437344667fc73318507e59f6ab38f217813a91224eb
    Image:         ghcr.io/jimmidyson/configmap-reload:v0.12.0
    Image ID:      ghcr.io/jimmidyson/configmap-reload@sha256:a7c754986900e41fc47656bdc8dfce33227112a7cce547e0d9ef5d279f4f8e99
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/alloy
      --webhook-url=http://localhost:12345/-/reload
    State:          Running
      Started:      Wed, 24 Apr 2024 20:20:47 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     5Mi
    Environment:  <none>
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)

The text was updated successfully, but these errors were encountered:

madaraszg-tulip · 2024-04-24T19:03:35Z

Additional logs from the oom killer:

[13349.644248] alloy invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=679
[13349.645913] CPU: 1 PID: 77501 Comm: alloy Not tainted 5.10.209-198.858.amzn2.aarch64 #1
[13349.647382] Hardware name: Amazon EC2 m7g.large/, BIOS 1.0 11/1/2018
[13349.648527] Call trace:
[13349.648976]  dump_backtrace+0x0/0x204
[13349.649715]  show_stack+0x1c/0x24
[13349.650387]  dump_stack+0xe4/0x12c
[13349.651052]  dump_header+0x4c/0x1f0
[13349.651769]  oom_kill_process+0x24c/0x250
[13349.652594]  out_of_memory+0xdc/0x344
[13349.653326]  mem_cgroup_out_of_memory+0x130/0x148
[13349.654279]  try_charge+0x55c/0x5cc
[13349.654967]  mem_cgroup_charge+0x80/0x240
[13349.655772]  do_anonymous_page+0xb8/0x574
[13349.656580]  handle_pte_fault+0x1a0/0x218
[13349.657379]  __handle_mm_fault+0x1e0/0x380
[13349.658210]  handle_mm_fault+0xcc/0x230
[13349.659000]  do_page_fault+0x14c/0x410
[13349.659723]  do_translation_fault+0xac/0xd0
[13349.660564]  do_mem_abort+0x44/0xa0
[13349.661275]  el0_da+0x40/0x78
[13349.661888]  el0_sync_handler+0xd8/0x120
[13349.662744] memory: usage 2560000kB, limit 2560000kB, failcnt 0
[13349.663826] memory+swap: usage 2560000kB, limit 2560000kB, failcnt 22450
[13349.665033] kmem: usage 8736kB, limit 9007199254740988kB, failcnt 0
[13349.666153] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope:
[13349.666998] anon 2612391936
[13349.666998] file 135168
[13349.666998] kernel_stack 147456
[13349.666998] percpu 0
[13349.666998] sock 0
[13349.666998] shmem 0
[13349.666998] file_mapped 0
[13349.666998] file_dirty 0
[13349.666998] file_writeback 0
[13349.666998] anon_thp 0
[13349.666998] inactive_anon 2612256768
[13349.666998] active_anon 0
[13349.666998] inactive_file 98304
[13349.666998] active_file 0
[13349.666998] unevictable 0
[13349.666998] slab_reclaimable 1049680
[13349.666998] slab_unreclaimable 0
[13349.666998] slab 1049680
[13349.666998] workingset_refault_anon 0
[13349.666998] workingset_refault_file 0
[13349.666998] workingset_activate_anon 0
[13349.666998] workingset_activate_file 0
[13349.666998] workingset_restore_anon 0
[13349.666998] workingset_restore_file 0
[13349.666998] workingset_nodereclaim 0
[13349.666998] pgfault 3747711
[13349.666998] pgmajfault 0
[13349.666998] pgrefill 0
[13349.687594] Tasks state (memory values in pages):
[13349.688440] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[13349.690105] [  77488]     0 77488  1440016   668293  7532544        0           679 alloy
[13349.691581] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task=alloy,pid=77488,uid=0
[13349.701478] Memory cgroup out of memory: Killed process 77488 (alloy) total-vm:5760064kB, anon-rss:2550676kB, file-rss:122496kB, shmem-rss:0kB, UID:0 pgtables:7356kB oom_score_adj:679

mattdurham · 2024-04-24T19:25:47Z

Can you share a heap pprof dump? curl http://localhost:12345/debug/pprof/heap -o heap.pprof either here on dm me on the community slack?

madaraszg-tulip · 2024-04-24T19:31:06Z

I was just collecting pprof outputs, and trying to compare them (this is the first time I'm looking at pprof). I've unfortunately overwritten the raw dumps, here are two png's generated from them.

from alloy, a few seconds before it OOM'd

from grafana-agent, in the prod environment, with same type of feed, but higher traffic

madaraszg-tulip · 2024-04-24T21:01:05Z

Turning exemplars off in the spanmetrics connector seems to have stabilized the memory consumption:

wildum · 2024-04-25T08:27:36Z

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

ptodev · 2024-04-25T09:06:38Z

This looks related to an issue in the OTel repo, which is fixed in v0.99. We will need to update the OTel dependency in Alloy to pick up the fix.

madaraszg-tulip · 2024-04-25T15:50:56Z

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

Regarding the spanmetrics pipeline: yes, same config. I assume it was among the dropped nodes due to low memory impact.

github-actions · 2024-05-28T00:01:24Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

ptodev · 2024-05-29T08:28:41Z

This issue should now be resolved because Alloy is now using a new OTel version with the bugfix mentioned above.

madaraszg-tulip added the bug Something isn't working label Apr 24, 2024

github-actions bot added the needs-attention label May 28, 2024

ptodev closed this as completed May 29, 2024

github-actions bot added the frozen-due-to-age label Jun 29, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in Alloy #670

Possible memory leak in Alloy #670

madaraszg-tulip commented Apr 24, 2024

madaraszg-tulip commented Apr 24, 2024

mattdurham commented Apr 24, 2024

madaraszg-tulip commented Apr 24, 2024

madaraszg-tulip commented Apr 24, 2024

wildum commented Apr 25, 2024

ptodev commented Apr 25, 2024

madaraszg-tulip commented Apr 25, 2024

github-actions bot commented May 28, 2024

ptodev commented May 29, 2024

Possible memory leak in Alloy #670

Possible memory leak in Alloy #670

Comments

madaraszg-tulip commented Apr 24, 2024

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

madaraszg-tulip commented Apr 24, 2024

mattdurham commented Apr 24, 2024

madaraszg-tulip commented Apr 24, 2024

madaraszg-tulip commented Apr 24, 2024

wildum commented Apr 25, 2024

ptodev commented Apr 25, 2024

madaraszg-tulip commented Apr 25, 2024

github-actions bot commented May 28, 2024

ptodev commented May 29, 2024