Stabilize broker components #1265

yolocs · 2020-06-11T00:45:56Z

Describe the bug
Under certain setup and scale, the broker components become unstable. Here are a couple of problems I have observed.

Setup: 1 broker, 100 triggers (subscribing to the same event), 100 ingress qps.

Problem 1: kube-dns stopped responding. Errors from fanout pods are like {"level":"warn","ts":"2020-06-10T18:54:20.552Z","logger":"broker-fanout","caller":"deliver/processor.go:118","msg":"target delivery failed","commit":"a12cfb0","target":"loadtest/testbroker/trigger-actor-57","error":"Post \"http://actor-57.loadtest.svc.cluster.local/\": dial tcp: lookup actor-57.loadtest.svc.cluster.local on 10.4.0.10:53: dial udp 10.4.0.10:53: operation was canceled"}

Problem 2: unstable memory usage causing pods being killed for OOM. When I increased ingress qps from 10 to 100, there will be a surge of memory usage. With our current memory limit, existing fanout pods got killed instantly (before HPA could kick in) which ended up with crash lookback. However, after I removed the memory limit and repeated the same process, the fanout pods eventually consumed less memory than our limit after the initial memory usage spike.

Problem 3: Number of triggers to fanout to (which we can't control) dominates memory usage. Ideally we want a higher pubsub pull concurrency for better throughput. However, when there are a little bit more triggers, the cost on memory will be at least x (# of triggers). With a memory limit, this can easily lead to OOM and causing problem2. But if we limit pull concurrency, then we run the risk of low throughput even when there are not many triggers.

Proposed solutions

I believe user can modify the scale of kube-dns in kube-system namespace
Increase MaxIdleConns of delivery HTTP client to increase connection reuse (so that less load will be put on dns)
Remove all memory limits or set those limits really high (e.g. 5Gi)
- The GKE guideline now suggests setting memory limit = request (we could potentially make an exception)
- This could be controversial if we end up setting request as high as the limit.

The text was updated successfully, but these errors were encountered:

liu-cong · 2020-06-11T17:25:49Z

cc

grantr · 2020-06-11T18:31:39Z

Thank you for doing this research @yolocs! Great to have this clear breakdown of the issues.

My initial thoughts:

Problem 1: This seems like a bug. We should be reusing connections and only talking to DNS when a connection is made. Hopefully a fix is possible.

Problem 2: Tricky. Seems like more active memory management and backpressure to avoid this. Or maybe consider using a Knative service? Controversial! But we need stable memory usage even with Serving so let's try fixing that first.

Problem 3: There are probably some improvements we can make to per-trigger memory usage but this seems like a fundamental horizontal scaling issue that we'll be working on for a long time. In our initial design doc we wanted to support 1000 triggers so we should at least hit that target. The BrokerCell might be able to do some simple vertical scaling calculations based on the number of triggers.

yolocs · 2020-08-04T23:15:43Z

More concrete issues (e.g. #1548, #1550, #1511, #1540) are replacing this one.

yolocs added the kind/bug Something isn't working label Jun 11, 2020

yolocs added this to the Backlog milestone Jun 11, 2020

yolocs added area/broker priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 labels Jun 11, 2020

yolocs mentioned this issue Jun 11, 2020

Tune broker components #1269

Merged

grantr mentioned this issue Jul 22, 2020

Document broker stress testing scenarios #1500

Closed

liu-cong assigned yolocs Aug 4, 2020

yolocs closed this as completed Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize broker components #1265

Stabilize broker components #1265

yolocs commented Jun 11, 2020 •

edited by liu-cong

Loading

liu-cong commented Jun 11, 2020

grantr commented Jun 11, 2020

yolocs commented Aug 4, 2020

Stabilize broker components #1265

Stabilize broker components #1265

Comments

yolocs commented Jun 11, 2020 • edited by liu-cong Loading

liu-cong commented Jun 11, 2020

grantr commented Jun 11, 2020

yolocs commented Aug 4, 2020

yolocs commented Jun 11, 2020 •

edited by liu-cong

Loading