Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Weighted Random Selection Queue (WRSQ) load balancing #14597

Open
mattklein123 opened this issue Jan 7, 2021 · 17 comments
Open

Implement Weighted Random Selection Queue (WRSQ) load balancing #14597

mattklein123 opened this issue Jan 7, 2021 · 17 comments
Assignees

Comments

@mattklein123
Copy link
Member

mattklein123 commented Jan 7, 2021

See #14360 (comment) and tonya11en#1 (comment). This will require #14569 to be merged and #14360 to be resolved.

We should use WRSQ for:

  • Weighted priority picking
  • Weighted locality picking
  • WRR host picking

We will still use EDF for WLR host picking or any other future picking in which weights can and do rapidly change.

cc @tonya11en @antoniovicente @htuch @snowp

@mattklein123
Copy link
Member Author

@tonya11en is this something that you would like to implement or should I look at implementing it?

@tonya11en
Copy link
Member

@mattklein123 I'd like to implement it, but will probably be sending a patch out over the weekend. I don't feel strongly about it, so if you can knock it out sooner, go for it and I'll be happy to review.

@mattklein123
Copy link
Member Author

@tonya11en go for it. I will assign over to you also and I will assume you are going to work on it for now.

@htuch
Copy link
Member

htuch commented Jan 8, 2021

Can we guard existing behavior with runtime for at least a period of time? I agree with direction here but would like to make sure we move carefully when changing LB behaviors. In separate discussions with @antoniovicente, I think we want to make substantial changes to affinity balancing (i.e. that break hash consistency) to become new LB algorithms so that we can safely rollout and migrate.

@mattklein123
Copy link
Member Author

Yes, the plan would be to feature flag for sure.

I think we want to make substantial changes to affinity balancing (i.e. that break hash consistency) to become new LB algorithms so that we can safely rollout and migrate.

Sorry can you clarify?

@htuch
Copy link
Member

htuch commented Jan 8, 2021

I think we want to make substantial changes to affinity balancing (i.e. that break hash consistency) to become new LB algorithms so that we can safely rollout and migrate.

Sorry can you clarify?

Let's say we modify ring hash LB and change the hash algorithm, so that a new binary rollout would not be consistent with the hashing of a previous Envoy with ring hash LB. I think the safe thing here to do would be to retain the existing algorithm and allow the fleet to flip to the new hash function once rollout completes. This could be done with a runtime flag, but it might be more convenient to treat it as a new LB algorithm as well.

@mattklein123
Copy link
Member Author

This could be done with a runtime flag, but it might be more convenient to treat it as a new LB algorithm as well.

Yeah this makes sense. We should talk more about this offline to figure out the best way of handling this. We have done some hash algorithm changes at Lyft and it's pretty painful to roll out. Would be nice to do something better if possible.

@antoniovicente
Copy link
Contributor

I assume efficiency improvements to hash balancers to involve the introduction of new types. Changing the algorithm used via config would be a big challenge since it would likely break a lot of things during the time the config in production hasn't fully converged to the new state.

@htuch
Copy link
Member

htuch commented Jan 11, 2021

@antoniovicente how would you suggest changing the algorithm (or version of the algorithm)? Is there some way to do this in production that wouldn't involve some period of inconsistency?

@yishuT
Copy link

yishuT commented Feb 7, 2021

is there wiki page or paper about WRSQ?

@tonya11en
Copy link
Member

is there wiki page or paper about WRSQ?

@yishuT I dreamed this one up, so unfortunately I don't have any reference material beyond what I wrote in the PR for the scheduler #14681. I'm happy to answer any questions you may have, to the best of my ability.

@yishuT
Copy link

yishuT commented Feb 7, 2021

is there wiki page or paper about WRSQ?

@yishuT I dreamed this one up, so unfortunately I don't have any reference material beyond what I wrote in the PR for the scheduler #14681. I'm happy to answer any questions you may have, to the best of my ability.

ah wow, nice! Thanks for the pointer. I will read through the PR

@antoniovicente
Copy link
Contributor

@antoniovicente how would you suggest changing the algorithm (or version of the algorithm)? Is there some way to do this in production that wouldn't involve some period of inconsistency?

I don't think there's a way to accomplish this without having an user visible rehash event.

@jmarantz
Copy link
Contributor

/sub

@jmarantz
Copy link
Contributor

WIth #14681 merged, what's the next step to resolving this bug?

@tonya11en
Copy link
Member

WIth #14681 merged, what's the next step to resolving this bug?

I think to address the issue that motivated this thing, I'll swap out the locality scheduler with the WRSQ scheduler. I'll guard it with a feature flag that can revert back to EDF in case it ends up causing unforeseen problems.

Assuming we have good results, I'll look into adding a config option in the common LB config.

@tonya11en
Copy link
Member

tonya11en commented Aug 26, 2021

FYI, I'm putting finishing touches on the final PR and will send it out in the next couple of days.

update (9/2/2021):
Many tests relied on EDF's deterministic behavior, so I'm making changes to the tests. It took a bit longer than expected, but I'm making incremental progress on it. Hopefully won't be much longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants