Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve partition assignment balance (#93) #97

Merged
merged 3 commits into from
Feb 10, 2021

Conversation

bobh66
Copy link
Contributor

@bobh66 bobh66 commented Feb 9, 2021

Note: Before submitting this pull request, please review our contributing
guidelines
.

Description

Modifies the partition assignment logic to produce a more balanced distribution of partitions across clients. Specifically:

  • when table_standby_replicas is set to 0, and the number of partitions is >= the number of clients, on a re-assignment the partitioner will calculate the "minimum" capacity for each client, the smallest number of partitions that each client should have in a balanced distribution, and remove partitions from existing clients that have more than the minimum number. This is to ensure that there are enough unassigned partitions available to assign to all clients.

  • the partitioner will sort the "candidate" clients by the number of (active or standby) partitions that each client has, and then round-robin over that set, so that the clients with the smallest number of partitions get new partitions added to them before clients with larger numbers of partitions. This only works reliably when table_standby_replicas is 0, because there is logic that will "promote" standby partition assignment to active, which preempts the balancing logic.

@jkgenser
Copy link

@bobh66: Super excited for this. I came across this recently when I was scaling from 1-8 workers for an 8 partition topic. It's unfortunate that workers 5, 6, and I think 7 do not get a partition. Only once 8/8 workers are up does the load split to 1 partition per worker. This will enable more granular scaling with no standby replicas. Also separately, it looks like in the next release, we'll be able to use one shared rocksdb file for multiple workers which reduces the need for standby replicas in this config.

Of course, people with standby replicas in a sharded environment might want to use them.

@bobh66
Copy link
Contributor Author

bobh66 commented Feb 10, 2021

@jkgenser Note that the changes will only remove partitions from existing pods when table_standby_replicas is 0, and the default is 1, so you have to "activate" this change by setting it to 0. If you are using tables then I would not set table_standby_replicas to 0 unless you don't need the table data to persist across a restart or be accessed by multiple pods.

There is a lot more complexity involved when trying to combine the need for balance with the need for "sticky" and "standby" partitions, so this is the first step to handle the simple case where no standby partitions are involved.

@jkgenser
Copy link

jkgenser commented Feb 10, 2021

Makes sense. I'm planning on having multiple workers share the same DATA_DIR and bind mount the same dir to all workers. Even though only one worker will own a given rocksdb-partition at a time. This would be the primary case to use your new assignment logic I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants