Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleProducer randomization of initial round robin ordering #139

Merged
merged 4 commits into from
May 7, 2014
Merged

SimpleProducer randomization of initial round robin ordering #139

merged 4 commits into from
May 7, 2014

Conversation

alexcb
Copy link
Contributor

@alexcb alexcb commented Mar 12, 2014

Hello Folks,

I've noticed that the simple producer will always publish to partition 0, then 1, etc.

To achieve a better uniform distribution of messages per partition, would you consider
randomizing the list of partitions first? This would be especially helpful for cases
where the publisher only publishes a single message before destroying the SimpleProducer instance.

of partitions to prevent the first message from always being published
to partition 0.
@alexcb
Copy link
Contributor Author

alexcb commented Mar 12, 2014

I'm having troubles getting the integration test running in order to verify this on my own; however, I suspect it's due to the fact the test assumes that the offset will have been committed against partition 0, except now it's been randomized.

the initial round-robin ordering of partitions have been randomized
@alexcb
Copy link
Contributor Author

alexcb commented Mar 12, 2014

I've modified the integration tests to work with the random round-robin change. Please let me know if anything else is needed.

@alexcb
Copy link
Contributor Author

alexcb commented Mar 31, 2014

bump.. any thoughts on this?

@dpkp
Copy link
Owner

dpkp commented Mar 31, 2014

I agree that starting on partition 0 every time is a problem in many circumstances. But I wonder if this would be better as a different class? Perhaps RandomProducer ? or RandomCycleProducer?

@alexcb
Copy link
Contributor Author

alexcb commented Mar 31, 2014

Are there any use-cases where always starting with partition 0 is useful?

@dpkp
Copy link
Owner

dpkp commented Mar 31, 2014

easier for people who don't understand kafka yet to grok, that's about it. I'm not sure that the SimpleProducer should be expected to solve the even distribution problem. FWIW, at work we address this problem internally by using a custom producer class.

@alexcb
Copy link
Contributor Author

alexcb commented Apr 1, 2014

Thanks for the response,

I'm a bit puzzled as to what the SimpleProducer should be used for. If it's merely there as an example then perhaps it should be more appropriately named ExampleProducer as a precaution against using it in production.

If the intended use-case is to consistently publish to the first batch of messages to the first partition, then adding comments on how it should be used would be helpful.

If the issue is with the implementation of randomly shuffling the list of partitions, then perhaps we could just randomize the first initial starting partition. e.g. if partitions = [0,1,2,3], we could randomly pick to start on partition 2, then cycle 3,4,0,1...

As an aside, I like the idea of a RandomProducer which could just randomly pick a partition each time: return random.choice(self.client.topic_partitions[topic])

@dpkp
Copy link
Owner

dpkp commented Apr 1, 2014

I think originally SimpleProducer was intended to mirror the SimpleConsumer in, well, simplicity. SimpleConsumer was a mirror of the similarly named class in the reference java client. But the java client has no SimpleProducer (partly because production is always partition-specific, not topic), so I think the api was sort of made up.

Nonetheless, perhaps a suitable compromise would be to include an optional parameter like 'random_start' that, if set to True, would randomly select the first partition but then cycle as normal from there.

@wizzat
Copy link
Collaborator

wizzat commented Apr 1, 2014

What is the down side of having the initial partition selection be
randomized? Python tends to favor a "batteries included" approach, and I
can't imagine when the desired behavior of a "simple producer" would be to
always start with partition 0 - though I can think of a lot of ways where
this would not be the desired behavior. Remember: not everyone cares about
or even wants to know about Kafka internals - and no it's not a requirement
to send messages to a queue.

On Mon, Mar 31, 2014 at 7:54 PM, Dana Powers notifications@github.comwrote:

I think originally SimpleProducer was intended to mirror the
SimpleConsumer in, well, simplicity. SimpleConsumer was a mirror of the
similarly named class in the reference java client. But the java client has
no SimpleProducer (partly because production is always partition-specific,
not topic), so I think the api was sort of made up.

Nonetheless, perhaps a suitable compromise would be to include an optional
parameter like 'random_start' that, if set to True, would randomly select
the first partition but then cycle as normal from there.

Reply to this email directly or view it on GitHubhttps://github.com//pull/139#issuecomment-39166091
.

@alexcb
Copy link
Contributor Author

alexcb commented Apr 1, 2014

I'm up for adding a random_start boolean to the SimpleConsumer if that's what it'll take; however, I'm more inclined to agree with @wizzat in terms of the batteries-included sensible defaults (i.e. random_start=true) or better yet remove the option completely if there's no valid use-case for always starting with partition 0.

I can code up a prototype and update this pull request.

Cheers,
Alex

alexcb added 2 commits April 1, 2014 18:07
…ion of the sorted list of partition rather than completely randomizing the initial ordering before round-robin cycling the partitions
…ation of the initial partition messages are published to
@alexcb
Copy link
Contributor Author

alexcb commented Apr 2, 2014

I've added in a random_start parameter which defaults to false for the time being. Please let me know if anything else is needed.

@alexcb
Copy link
Contributor Author

alexcb commented Apr 15, 2014

bump... Any thoughts on this one?

I think the current SimpleProducer suffers from a thundering herd style problem and would like to see it mitigated. If anyone else has a different implementation I'm completely fine with that.

@wizzat
Copy link
Collaborator

wizzat commented Apr 15, 2014

I think this should be merged as of commit a81be57 (the initial pass with full pseudorandom partition order and passing tests)

@wizzat
Copy link
Collaborator

wizzat commented Apr 15, 2014

Just so everyone is on the same page: any process which starts up and sends a single message to Kakfa will currently overload partition 0 to the exclusion of all other partitions. A common use case of this would be web servers/microframeworks and many command line utilities. The default of not randomizing the partitions by default is literally dangerous.

@mumrah
Copy link
Collaborator

mumrah commented Apr 15, 2014

+1 looks good to me. If there are no objections I will commit this later this week.

@alexcb
Copy link
Contributor Author

alexcb commented May 6, 2014

bump...Any news on this?

There's currently randomization used when picking a host to connect to:
https://github.com/mumrah/kafka-python/blob/master/kafka/conn.py#L16 which was added by @mrtheb if I'm not mistaken.

It's not that different from wanting to randomize which partition is initially selected for publishing messages to.

dpkp added a commit that referenced this pull request May 7, 2014
SimpleProducer randomization of initial round robin ordering
@dpkp dpkp merged commit 3b18043 into dpkp:master May 7, 2014
wbarnha added a commit to kemalty/kafka-python that referenced this pull request Mar 14, 2024
Co-authored-by: drewdogg <drewdogg@users.noreply.github.com>
bradenneal1 pushed a commit to bradenneal1/kafka-python that referenced this pull request May 16, 2024
Co-authored-by: drewdogg <drewdogg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants