Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Tune broker components #1269

Merged
merged 3 commits into from
Jun 12, 2020
Merged

Tune broker components #1269

merged 3 commits into from
Jun 12, 2020

Conversation

yolocs
Copy link
Member

@yolocs yolocs commented Jun 11, 2020

Some mitigation for #1265

Proposed Changes

  • Adjusted broker components memory resource
    • Increased ingress memory limit to 1000Mi
    • Set memory limits much high than requested
  • Adjusted HPA
    • The avg memory usage is set to half of the limit for fanout/retry (this hopefully helps to mitigate some problem from surging memory usage)
    • Lower the max replicas for fanout/retry as I've seen a higher error rate of dns when reaching more than 10 replicas
  • Relaxed liveness probe timeout (would help reduce some noise)
  • Override MaxIdleConns for delivery HTTP client (would help reusing connections and reduce the chance of dns error)

Release Note

- Adjusted broker components memory resource
  - Increased ingress memory limit to 1000Mi
  - Set memory limits much high than requested
- Adjusted HPA
  - The avg memory usage is set to half of the limit for fanout/retry (this hopefully helps to mitigate some problem from surging memory usage)
  - Lower the max replicas for fanout/retry as I've seen a higher error rate of dns when reaching more than 10 replicas
- Relaxed liveness probe timeout (would help reduce some noise)
- Override `MaxIdleConns` for delivery HTTP client (would help reusing connections and reduce the chance of dns error)

Docs

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yolocs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@googlebot googlebot added the cla: yes (override cla status due to multiple authors bug) label Jun 11, 2020
@yolocs
Copy link
Member Author

yolocs commented Jun 11, 2020

/retest

@@ -37,6 +38,11 @@ var (

DefaultHTTPClient = &http.Client{
Transport: &ochttp.Transport{
Base: &http.Transport{
MaxIdleConns: 1000,
MaxIdleConnsPerHost: 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since generally we should have more than 1 targets, I think this value should be smaller than MaxIdleConns.

I suggest the following numbers. @grantr any thoughts on this?

MaxIdelConns: 1000,
// Assume an average of 20 triggers, just a guess. This makes sure idle conns are more fairly distributed across targets.
MaxIdleConnsPerHost: 50,
// Also set max conns per target to limit the resource usage per target. This should also help limit the memory usage.
MaxConnsPerHost:100,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestion. Some of my thinking:

  1. We don't have to guarantee fairness here. When you have 100 triggers but only one is actively receiving events, I see no reason to not have it take more idle connections. It should naturally evolve to "active triggers have more idle conns and less active ones have less". Although it might worth setting MaxIdleConnsPerHost a little bit less than MaxIdleConns. Under high concurrency I occasionally saw errors of failure to put new idle conn for a host because it's reaching max idle conns.

  2. Setting MaxConnsPerHost should be able to effectively limit resource usage. But there is a balance between this and maxing throughput. Especially when we don't have "queue length based" auto scaling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE
#1. We need to find a good balance point. A high MaxIdleConnsPerHost deals with "small number of active triggers" scenario well but not so well with many active triggers. So I took a guess of on average 20 active triggers. That might be too big. Ideally we can make this configurable but we have to pick a guess for now. How about setting MaxIdleConnsPerHost to 100 or 200?

#2 Agree it's a tradeoff. Since we can run into OOM, setting this will help. Maybe something like 200 is a good balance point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking a bigger number like 500 for both MaxIdleConnsPerHost and MaxConnsPerHost. Since now I set high memory limit with HPA on 50%, it should help with the OOM problem (not always but should help some). Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to have MaxIdleConnsPerHost< MaxConnsPerHost? This gives more room for other triggers to maintain some idle connection. The tradeoff is that it's less efficient for handling surges.

I don't think we can come up with perfect numbers We may need to tune it later anyway. I will lgtm with whatever you come up with :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Set them to 500 now.

@liu-cong
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot merged commit 1d5ef63 into google:master Jun 12, 2020
@yolocs yolocs deleted the b/reuseconn branch June 12, 2020 19:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved cla: yes (override cla status due to multiple authors bug) lgtm size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants