Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pull-based receive adapter HA and scalable #3157

Closed
lionelvillard opened this issue May 18, 2020 · 23 comments
Closed

Make pull-based receive adapter HA and scalable #3157

lionelvillard opened this issue May 18, 2020 · 23 comments
Assignees
Labels
area/api area/sources kind/feature-request lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@lionelvillard
Copy link
Member

The current multi-tenant receive adapter architecture does not scale which is a problem particularly for direct event delivery (from a source to a service).

Some background context:

  • The original receive adapter architecture scales but at the cost of consuming lot of resources and does not scale down to zero.
  • For PingSources specifically, the propose CronJob and SinkBinding model scales (arguably since the more CronJobs are being scheduled, the more inaccurate they become), but consumes lot of resources. It scales down to zero though.
  • The multi-tenant receive adapter architecture is designed to reduce the resource inefficiency inherent to the above architectures at the cost of not being scalable.
  • For the record, here the general requirement to make sources scalable: Make Knative eventing sources more serverless and scalable #2153

The multi-tenant receive adapter does not scale and there is a limit on the number of in-flight requests a single pod can handle. As an example, 100,000 in-flight requests requires 3GB and 3-4 (or more) CPUs. The number of in-flight requests can be very high due to direct delivery to synchronous and potentially long running services. 100,000 inflight requests is 10,000 PingSources scheduled to send event every minute to a 10mn long service. Most importantly, the latency increases proportionally to the number of in-flight requests (about 1-2 seconds per 500 requests).

I can think of two solutions:

  • static partitioning: the controller (e.g. the PingSource controller) assigns CRs to a particular receive adapter. The difficulty is to come up with a proper partitioning function that does not waste resources and rebalancing is a challenge (e.g. how to avoid duplicate events)
  • load balancing via event forwarding proxy: events are sent to a stateless event proxy service. This service distributes requests to the backend pods with the sole purpose to forward events to the actual target. One question is what load balancing strategy to use. By default, k8s load balancing is random but in this particular case an IPVS-based load balancer with least connection strategy seems to be the right solution. Maybe random distribution works but what happens when a request is sent to an overloaded pod?

TL;DR I'm leaning towards adding a general purpose event forwarding service to help with sending events to potentially long running, synchronous destination. This service can be internally used by sources. For instance, the PingSource controller can decide to forward events to the proxy when there are more than 500 direct PingSources. Potentially all multi-tenant sources could benefit from it.

I'd like to gather people thoughts on this issue so please add your comments below.

@lionelvillard
Copy link
Member Author

Additional note: it seems that Knative service is the right solution for the proxy. Only possibility is to implement the proxy in knative-sandbox and to have a configuration parameter named event-forwarding-proxy in eventing.

@lionelvillard
Copy link
Member Author

Additional note 2: direct event delivery does not have to be reliable (no guarantees). Think of the proxy as a trimmed down version of the IMC without the fan-out and without the channel/subscription.

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented May 19, 2020

I have some questions about some specific things i don't get:

  • About static partitioning: You cannot watch the same CRD instances from several controllers, unless you find some way to mutually exclude them (eg a "third part controller" applies a label that partition the CR, assigning it to a specific receive adapter instance). Even in that case, this sounds quite hard and convoluted to implement as you underlined.
  • About load balancing and Additional note 2, When i read this thread i had this exact feeling, when you have such a complex use case that requires scaling, would it be better to just say to users "create a channel in the middle"? Is there any specific reason why you want to optimize this use case of direct source -> sink?

@lionelvillard
Copy link
Member Author

  • About static partitioning: the idea is to let the controller assigning CRs to a given receive adapter. I agree it's hard and convoluted and that's why I don't like this solution. I just put it there for the record.
  • The user does not necessarily care about scaling direct event delivery. They just want it to just work. However, in a multi-tenant environment, the operator do care about it. They can't ask the use to "create a channel in the middle" because the mt receive adapter is busy handling requests from other tenants.

@slinkydeveloper does this answer your questions?

@slinkydeveloper
Copy link
Contributor

However, in a multi-tenant environment, the operator do care about it. They can't ask the use to "create a channel in the middle" because the mt receive adapter is busy handling requests from other tenants.

I think I lack some context to fully understand the implications and provide some meaningful feedback here 😄 Do you have any document explaining this + the mt architecture and the reasons behind it?

@lionelvillard
Copy link
Member Author

I'll have to dig up some old issues/PRs I guess :-)

In a nutshell, the multi-tenant receive adapters are more resource efficient compare to their single-tenant equivalent. For instance, most of the time the stping receive adapter is idle, wasting resources. Instead let's have one not-so-idle mtping receive adapter handling hundred or thousands of pingsources. That's a win.

Note that this is not only for PingSources, it applies to other sources (like GitHub).

However what we lost with the multi-tenant receive adapters is scalability, thus this proposal.

@matzew
Copy link
Member

matzew commented May 20, 2020

Isn't this more like some "ksvc" acting as a source, and have some sink injected ?

@lionelvillard
Copy link
Member Author

@matzew yes.

@duglin
Copy link

duglin commented May 22, 2020

Note that if knServices supported async natively (meaning returning a 202 quickly) then I think it would lessen the need for this because then the clients (event sources) get their outbound connection resources freed-up quickly, but the sink (the ksvc) would still scale up based on it processing lots of async requests and not be limited to just scaling on inbound requests (which are not there once the 202 is returned)

@n3wscott
Copy link
Contributor

You should inject a channel between the source and the n consumers.

@lionelvillard
Copy link
Member Author

I agree with injecting a channel. I wonder if the SinkBinding controller and the SinkBinding library is the right place to do this. WDYT @n3wscott ?

@n3wscott
Copy link
Contributor

I would not overload sink binding to do this. First how would you tell the control plane who a subscriber should be? I think this comes back to the idea of leveraging subscriptions to any object? I am not sure why this needs to be a magic layer in Knative to be honest, we have the building blocks to provide this, and the Flow would have been the object at the highest level of the model to enable what you asking for. Perhaps it is time to think about Flow again?

Tldr: I think this request should be solved with a new layer or abstraction, not overloading the objects in the current model with complexity.

@lionelvillard
Copy link
Member Author

I don't think we need a new abstraction. Direct delivery is IMHO a good abstraction, we (Knative authors) "just" need to make the implementation efficient and scalable (which might sounds magic).

I agree we do have (almost?) all the building blocks we need. We just need to figure out how to put them together so that direct delivery scales out-of-the-box.

@lionelvillard
Copy link
Member Author

For the record, PingSource is special because of its bursty and predicable traffic behavior. For schedule below a certain threshold (e.g. 2mn) connections should be kept opened.

@lberk lberk added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/api labels Jun 1, 2020
@duglin
Copy link

duglin commented Jun 16, 2020

It feels a bit like this issue is trying to cover many different topics and I'm getting confused. If it's just me, then ignore this comment:

The current multi-tenant receive adapter architecture does not scale which is a problem particularly for direct event delivery (from a source to a service).

Some questions:

  • are we talking about push or pull model? e.g github vs kafka model
  • are we talking about scaling the adapter due to the size/amount of incoming events?
  • are we talking about how to deal with outbound connections (adapters -> sink) due to those connections potentially being long-lived and we might run out of available connections?
  • are we talking about how to partition CRs across adapters that need to also scale (either due to the incoming events or due to the large # of CRs) ?
  • where does "direct event delivery" fit into any of this discussion? (aside from the outbound connection aspect mentioned above)
  • or is this focused on something totally different/?

If we can narrow down which problem we're trying to solve I think it would help focus the discussion.

@lionelvillard
Copy link
Member Author

Some questions:

* are we talking about push or pull model?  e.g github vs kafka model

It's more an issue for pull model.

* are we talking about scaling the adapter due to the size/amount of incoming events?

Not in this issue. Trying to reduce a bit the scope of this overarching issue: #2153.

* are we talking about how to deal with outbound connections (adapters -> sink) due to those connections potentially being long-lived and we might run out of available connections?

Yes. This issue applies to multi-tenant receive adapters that can handle many outbound connections.

* are we talking about how to partition CRs across adapters that need to also scale (either due to the incoming events or due to the large # of CRs) ?

That's one possible solution. The one I'm favoring right now.

* where does "direct event delivery" fit into any of this discussion? (aside from the outbound connection aspect mentioned above)

this is an attempt to reduce the scope of this discussion.

* or is this focused on something totally different/?

If we can narrow down which problem we're trying to solve I think it would help focus the discussion.

Yes. On the other hand there is an opportunity to solve the big scalability and HA (sorry adding something else on the table) issue once and for all. Maybe too ambitious ?

@lionelvillard lionelvillard changed the title Make direct event delivery scalable Make pull-based receive adapter HA and scalable Jun 17, 2020
@lionelvillard
Copy link
Member Author

I updated the title to better reflect what this issue is about.

I'm working on a PoC heavily influenced by the Per-Reconciler Leader Election Feature Track. The initial plan is:

  • Add bucket-based leader election in the adapter shared main. Leader election is optional (the mt github adapter is already HA and scalable). Use configmap resource lock
  • Eventually: improve the leader-election algo provided by the k8s client SDK to also be push-based to reduce downtime.
  • Refactor mtping:
    • use the adapter shared main (see the mt github adapter for an example)
    • use ConfigMaps containing ping source schedules, instead of informers
    • implement static partitioning and autoscaling in the PingSource controller

I haven't looked at other pull-based sources like KafkaSource or PrometheusSource in great details but I don't see a reason why the leader election approach shouldn't work for scaling. It should work for CouchDB.

If you are interested in this topic and want to help, let me know!

@slinkydeveloper
Copy link
Contributor

ah ok so you want to use leader election in order to perform static partitioning?

@lionelvillard
Copy link
Member Author

yes :-)

@lionelvillard
Copy link
Member Author

The k8s leader election impl is buggy: kubernetes/kubernetes#91942, and not ideal. We need a way plug alternative implementations.

@lionelvillard
Copy link
Member Author

Targeting 0.16 for mtping.

/milestone 0.16

@knative-prow-robot
Copy link
Contributor

@lionelvillard: You must be a member of the knative/knative-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

Targeting 0.16 for mtping.

/milestone 0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/sources kind/feature-request lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

8 participants