Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Stabilize broker components #1265

Closed
2 of 3 tasks
yolocs opened this issue Jun 11, 2020 · 3 comments
Closed
2 of 3 tasks

Stabilize broker components #1265

yolocs opened this issue Jun 11, 2020 · 3 comments
Assignees
Labels
area/broker kind/bug Something isn't working priority/1 Blocks current release defined by release/* label or blocks current milestone release/2
Milestone

Comments

@yolocs
Copy link
Member

yolocs commented Jun 11, 2020

Describe the bug
Under certain setup and scale, the broker components become unstable. Here are a couple of problems I have observed.

Setup: 1 broker, 100 triggers (subscribing to the same event), 100 ingress qps.

Problem 1: kube-dns stopped responding. Errors from fanout pods are like {"level":"warn","ts":"2020-06-10T18:54:20.552Z","logger":"broker-fanout","caller":"deliver/processor.go:118","msg":"target delivery failed","commit":"a12cfb0","target":"loadtest/testbroker/trigger-actor-57","error":"Post \"http://actor-57.loadtest.svc.cluster.local/\": dial tcp: lookup actor-57.loadtest.svc.cluster.local on 10.4.0.10:53: dial udp 10.4.0.10:53: operation was canceled"}

Problem 2: unstable memory usage causing pods being killed for OOM. When I increased ingress qps from 10 to 100, there will be a surge of memory usage. With our current memory limit, existing fanout pods got killed instantly (before HPA could kick in) which ended up with crash lookback. However, after I removed the memory limit and repeated the same process, the fanout pods eventually consumed less memory than our limit after the initial memory usage spike.

Problem 3: Number of triggers to fanout to (which we can't control) dominates memory usage. Ideally we want a higher pubsub pull concurrency for better throughput. However, when there are a little bit more triggers, the cost on memory will be at least x (# of triggers). With a memory limit, this can easily lead to OOM and causing problem2. But if we limit pull concurrency, then we run the risk of low throughput even when there are not many triggers.

Proposed solutions

  • I believe user can modify the scale of kube-dns in kube-system namespace
  • Increase MaxIdleConns of delivery HTTP client to increase connection reuse (so that less load will be put on dns)
  • Remove all memory limits or set those limits really high (e.g. 5Gi)
    • The GKE guideline now suggests setting memory limit = request (we could potentially make an exception)
    • This could be controversial if we end up setting request as high as the limit.
@yolocs yolocs added the kind/bug Something isn't working label Jun 11, 2020
@yolocs yolocs added this to the Backlog milestone Jun 11, 2020
@yolocs yolocs added area/broker priority/1 Blocks current release defined by release/* label or blocks current milestone release/2 labels Jun 11, 2020
@liu-cong
Copy link
Contributor

cc

@grantr
Copy link
Contributor

grantr commented Jun 11, 2020

Thank you for doing this research @yolocs! Great to have this clear breakdown of the issues.

My initial thoughts:

Problem 1: This seems like a bug. We should be reusing connections and only talking to DNS when a connection is made. Hopefully a fix is possible.

Problem 2: Tricky. Seems like more active memory management and backpressure to avoid this. Or maybe consider using a Knative service? Controversial! But we need stable memory usage even with Serving so let's try fixing that first.

Problem 3: There are probably some improvements we can make to per-trigger memory usage but this seems like a fundamental horizontal scaling issue that we'll be working on for a long time. In our initial design doc we wanted to support 1000 triggers so we should at least hit that target. The BrokerCell might be able to do some simple vertical scaling calculations based on the number of triggers.

@yolocs
Copy link
Member Author

yolocs commented Aug 4, 2020

More concrete issues (e.g. #1548, #1550, #1511, #1540) are replacing this one.

@yolocs yolocs closed this as completed Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/broker kind/bug Something isn't working priority/1 Blocks current release defined by release/* label or blocks current milestone release/2
Projects
None yet
Development

No branches or pull requests

3 participants