Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak in Alloy #670

Closed
madaraszg-tulip opened this issue Apr 24, 2024 · 9 comments
Closed

Possible memory leak in Alloy #670

madaraszg-tulip opened this issue Apr 24, 2024 · 9 comments
Labels

Comments

@madaraszg-tulip
Copy link
Contributor

What's wrong?

I tried migrating from grafana-agent to alloy, while keeping the flow configuration. While grafana-agent was using ~1.1GB of memory, alloy needs much, much more, and the pod regularly hits OOM.

The instances between 15:00 and 16:30 ran without GOMEMLIMIT, and with higher memory limits.
Currently running with the following settings:

  extraEnv:
    - name: GOMEMLIMIT
      value: 2000MiB
  resources:
    requests:
      memory: 2500Mi
      cpu: 1
    limits:
      memory: 2500Mi

image

We use grafana-agent / alloy only for downsampling traces, and additionally generating spanmetrics and service graph. Config is attached below. This config (with minimal differences, we used a batch processor in front of the servicegraph connector to work around the missing metrics_flush_interval option) has worked in a stable way with grafana-agent.

Steps to reproduce

Deploy alloy in kubernetes, with helm, as a drop in replacement for grafana-agent.

Feed traces from mimir/tempo/loki/grafana/prometheus

Watch as memory usage grows until pod is killed with OOM.

System information

Linux 5.10.209 aarch64 on EKS

Software version

Grafana Alloy v1.0.0

Configuration

tracing {
  sampling_fraction = 1.0

  write_to = [
    otelcol.processor.k8sattributes.enrich.input,
    otelcol.processor.transform.prepare_spanmetrics.input,
    otelcol.connector.servicegraph.graph.input,
  ]
}

otelcol.receiver.otlp "main" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }

  http {
    endpoint = "0.0.0.0:4318"
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.receiver.jaeger "main" {
  protocols {
    grpc {
      endpoint = "0.0.0.0:14250"
    }

    thrift_http {
      endpoint = "0.0.0.0:14268"
    }

    thrift_compact {
      endpoint = "0.0.0.0:6831"
    }
  }

  output {
    traces = [
      otelcol.processor.k8sattributes.enrich.input,
      otelcol.processor.transform.prepare_spanmetrics.input,
      otelcol.connector.servicegraph.graph.input,
    ]
  }
}

otelcol.processor.k8sattributes "enrich" {
  output {
    traces = [otelcol.processor.tail_sampling.trace_downsample.input]
  }
}

otelcol.connector.servicegraph "graph" {
  dimensions = ["http.method", "db.system"]

  metrics_flush_interval = "60s"

  store {
    ttl = "30s"
  }

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.transform "prepare_spanmetrics" {
  error_mode = "ignore"

  trace_statements {
    context = "resource"
    statements = [
      `keep_keys(attributes, ["service.name"])`,
    ]
  }

  output {
    traces = [otelcol.connector.spanmetrics.stats.input]
  }
}

otelcol.connector.spanmetrics "stats" {
  histogram {
    unit = "ms"
    exponential {
    }
  }

  exemplars {
    enabled = true
  }

  aggregation_temporality = "CUMULATIVE"
  namespace = "traces_spanmetrics_"
  metrics_flush_interval = "60s"

  output {
    metrics = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.tail_sampling "trace_downsample" {
  policy {
    name = "include-all-errors"
    type = "status_code"

    status_code {
      status_codes = ["ERROR"]
    }
  }

  policy {
    name = "include-all-slow-traces"
    type = "latency"

    latency {
      threshold_ms = 5000
    }
  }

  policy {
    name = "include-all-diagnostic-mode-traces"
    type = "boolean_attribute"

    boolean_attribute {
      key   = "diagnostics_mode"
      value = true
    }
  }

  policy {
    name = "downsample-all-others"
    type = "probabilistic"

    probabilistic {
      sampling_percentage = 1
    }
  }

  output {
    traces = [otelcol.processor.attributes.manual_tagging.input]
  }
}

otelcol.processor.attributes "manual_tagging" {

  action {
    action = "insert"
    key    = "garden"
    value  = "global"
  }


  action {
    action = "insert"
    key    = "generated_by"
    value  = "grafana_alloy"
  }

  action {
    action = "insert"
    key    = "grafana_alloy_hostname"
    value  = constants.hostname
  }

  output {
    traces = [otelcol.processor.batch.main.input]
    metrics = [
      otelcol.processor.batch.main.input,
    ]
  }
}

otelcol.processor.batch "main" {
  output {
    metrics = [otelcol.exporter.prometheus.global.input]
    traces  = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo-distributor:4318"
  }
  sending_queue {
    queue_size = 20000
  }
  retry_on_failure {
    enabled          = true
    max_elapsed_time = "2h"
    max_interval     = "1m"
  }
}

otelcol.exporter.prometheus "global" {
  include_target_info = false
  resource_to_telemetry_conversion = true
  gc_frequency = "1h"
  forward_to = [prometheus.remote_write.global.receiver]
}

prometheus.remote_write "global" {
  endpoint {
    url = "http://mimir-distributor:8080/api/v1/push"
    send_native_histograms = true
  }
}

Logs

Containers:
  alloy:
    Container ID:  containerd://543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84
    Image:         docker.io/grafana/alloy:v1.0.0
    Image ID:      docker.io/grafana/alloy@sha256:21248ad12831ad8f7279eb40ecd161b2574c2194ca76e7413996666d09beef6c
    Ports:         12345/TCP, 4317/TCP, 4318/TCP, 14250/TCP, 14268/TCP, 6831/UDP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Args:
      run
      /etc/alloy/config.alloy
      --storage.path=/tmp/alloy
      --server.http.listen-addr=0.0.0.0:12345
      --server.http.ui-path-prefix=/
      --stability.level=generally-available
    State:          Running
      Started:      Wed, 24 Apr 2024 20:46:12 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 24 Apr 2024 20:37:28 +0200
      Finished:     Wed, 24 Apr 2024 20:45:39 +0200
    Ready:          True
    Restart Count:  3
    Limits:
      memory:  2500Mi
    Requests:
      cpu:      1
      memory:   2500Mi
    Readiness:  http-get http://:12345/-/ready delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ALLOY_DEPLOY_MODE:  helm
      HOSTNAME:            (v1:spec.nodeName)
      GOMEMLIMIT:         2000MiB
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
  config-reloader:
    Container ID:  containerd://ebf2f2409762dccace3af437344667fc73318507e59f6ab38f217813a91224eb
    Image:         ghcr.io/jimmidyson/configmap-reload:v0.12.0
    Image ID:      ghcr.io/jimmidyson/configmap-reload@sha256:a7c754986900e41fc47656bdc8dfce33227112a7cce547e0d9ef5d279f4f8e99
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/alloy
      --webhook-url=http://localhost:12345/-/reload
    State:          Running
      Started:      Wed, 24 Apr 2024 20:20:47 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     5Mi
    Environment:  <none>
    Mounts:
      /etc/alloy from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4b49j (ro)
@madaraszg-tulip madaraszg-tulip added the bug Something isn't working label Apr 24, 2024
@madaraszg-tulip
Copy link
Contributor Author

Additional logs from the oom killer:

[13349.644248] alloy invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=679
[13349.645913] CPU: 1 PID: 77501 Comm: alloy Not tainted 5.10.209-198.858.amzn2.aarch64 #1
[13349.647382] Hardware name: Amazon EC2 m7g.large/, BIOS 1.0 11/1/2018
[13349.648527] Call trace:
[13349.648976]  dump_backtrace+0x0/0x204
[13349.649715]  show_stack+0x1c/0x24
[13349.650387]  dump_stack+0xe4/0x12c
[13349.651052]  dump_header+0x4c/0x1f0
[13349.651769]  oom_kill_process+0x24c/0x250
[13349.652594]  out_of_memory+0xdc/0x344
[13349.653326]  mem_cgroup_out_of_memory+0x130/0x148
[13349.654279]  try_charge+0x55c/0x5cc
[13349.654967]  mem_cgroup_charge+0x80/0x240
[13349.655772]  do_anonymous_page+0xb8/0x574
[13349.656580]  handle_pte_fault+0x1a0/0x218
[13349.657379]  __handle_mm_fault+0x1e0/0x380
[13349.658210]  handle_mm_fault+0xcc/0x230
[13349.659000]  do_page_fault+0x14c/0x410
[13349.659723]  do_translation_fault+0xac/0xd0
[13349.660564]  do_mem_abort+0x44/0xa0
[13349.661275]  el0_da+0x40/0x78
[13349.661888]  el0_sync_handler+0xd8/0x120
[13349.662744] memory: usage 2560000kB, limit 2560000kB, failcnt 0
[13349.663826] memory+swap: usage 2560000kB, limit 2560000kB, failcnt 22450
[13349.665033] kmem: usage 8736kB, limit 9007199254740988kB, failcnt 0
[13349.666153] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope:
[13349.666998] anon 2612391936
[13349.666998] file 135168
[13349.666998] kernel_stack 147456
[13349.666998] percpu 0
[13349.666998] sock 0
[13349.666998] shmem 0
[13349.666998] file_mapped 0
[13349.666998] file_dirty 0
[13349.666998] file_writeback 0
[13349.666998] anon_thp 0
[13349.666998] inactive_anon 2612256768
[13349.666998] active_anon 0
[13349.666998] inactive_file 98304
[13349.666998] active_file 0
[13349.666998] unevictable 0
[13349.666998] slab_reclaimable 1049680
[13349.666998] slab_unreclaimable 0
[13349.666998] slab 1049680
[13349.666998] workingset_refault_anon 0
[13349.666998] workingset_refault_file 0
[13349.666998] workingset_activate_anon 0
[13349.666998] workingset_activate_file 0
[13349.666998] workingset_restore_anon 0
[13349.666998] workingset_restore_file 0
[13349.666998] workingset_nodereclaim 0
[13349.666998] pgfault 3747711
[13349.666998] pgmajfault 0
[13349.666998] pgrefill 0
[13349.687594] Tasks state (memory values in pages):
[13349.688440] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[13349.690105] [  77488]     0 77488  1440016   668293  7532544        0           679 alloy
[13349.691581] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod71cedea9_e04a_4bb8_b615_6b4e27698c1c.slice/cri-containerd-543b6573ecc932c15929c6ac2508ebcb72a52eddf24475590afbe13fe4bdcc84.scope,task=alloy,pid=77488,uid=0
[13349.701478] Memory cgroup out of memory: Killed process 77488 (alloy) total-vm:5760064kB, anon-rss:2550676kB, file-rss:122496kB, shmem-rss:0kB, UID:0 pgtables:7356kB oom_score_adj:679

@mattdurham
Copy link
Collaborator

Can you share a heap pprof dump? curl http://localhost:12345/debug/pprof/heap -o heap.pprof either here on dm me on the community slack?

@madaraszg-tulip
Copy link
Contributor Author

I was just collecting pprof outputs, and trying to compare them (this is the first time I'm looking at pprof). I've unfortunately overwritten the raw dumps, here are two png's generated from them.

  1. from alloy, a few seconds before it OOM'd

alloy-pprof

  1. from grafana-agent, in the prod environment, with same type of feed, but higher traffic

grafana-agent-pprof

@madaraszg-tulip
Copy link
Contributor Author

Turning exemplars off in the spanmetrics connector seems to have stabilized the memory consumption:
image

@wildum
Copy link
Contributor

wildum commented Apr 25, 2024

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

@ptodev
Copy link
Contributor

ptodev commented Apr 25, 2024

This looks related to an issue in the OTel repo, which is fixed in v0.99. We will need to update the OTel dependency in Alloy to pick up the fix.

@madaraszg-tulip
Copy link
Contributor Author

I don't see the spanmetrics connector in the pprof of the grafana-agent, are you running the same exact configs?

Regarding the spanmetrics pipeline: yes, same config. I assume it was among the dropped nodes due to low memory impact.

Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@ptodev
Copy link
Contributor

ptodev commented May 29, 2024

This issue should now be resolved because Alloy is now using a new OTel version with the bugfix mentioned above.

@ptodev ptodev closed this as completed May 29, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants