This repository has been archived by the owner on Jun 19, 2022. It is now read-only.
Stabilize broker components #1265
Labels
area/broker
kind/bug
Something isn't working
priority/1
Blocks current release defined by release/* label or blocks current milestone
release/2
Milestone
Describe the bug
Under certain setup and scale, the broker components become unstable. Here are a couple of problems I have observed.
Setup: 1 broker, 100 triggers (subscribing to the same event), 100 ingress qps.
Problem 1: kube-dns stopped responding. Errors from fanout pods are like
{"level":"warn","ts":"2020-06-10T18:54:20.552Z","logger":"broker-fanout","caller":"deliver/processor.go:118","msg":"target delivery failed","commit":"a12cfb0","target":"loadtest/testbroker/trigger-actor-57","error":"Post \"http://actor-57.loadtest.svc.cluster.local/\": dial tcp: lookup actor-57.loadtest.svc.cluster.local on 10.4.0.10:53: dial udp 10.4.0.10:53: operation was canceled"}
Problem 2: unstable memory usage causing pods being killed for OOM. When I increased ingress qps from 10 to 100, there will be a surge of memory usage. With our current memory limit, existing fanout pods got killed instantly (before HPA could kick in) which ended up with crash lookback. However, after I removed the memory limit and repeated the same process, the fanout pods eventually consumed less memory than our limit after the initial memory usage spike.
Problem 3: Number of triggers to fanout to (which we can't control) dominates memory usage. Ideally we want a higher pubsub pull concurrency for better throughput. However, when there are a little bit more triggers, the cost on memory will be at least x (# of triggers). With a memory limit, this can easily lead to OOM and causing problem2. But if we limit pull concurrency, then we run the risk of low throughput even when there are not many triggers.
Proposed solutions
MaxIdleConns
of delivery HTTP client to increase connection reuse (so that less load will be put on dns)The text was updated successfully, but these errors were encountered: