Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strimzi support for Horizontal Pod Autoscaling #3331

Closed
sumukhballal opened this issue Jul 16, 2020 · 5 comments
Closed

Strimzi support for Horizontal Pod Autoscaling #3331

sumukhballal opened this issue Jul 16, 2020 · 5 comments

Comments

@sumukhballal
Copy link

sumukhballal commented Jul 16, 2020

Is your feature request related to a problem? Please describe.

I am in process of trying to integrate Strimzi Kafka with Horizontal Pod Autoscaling Feature Kubernetes offers. I faced some issues with this and was eventually successful. I saw that the Roadmap did not yet feature HPA support but I still went ahead and tried to implement it.

Describe the solution you'd like

A eventual solution would be to integrate HPA out of the box i.e have a HPA section in the Kafka kind yaml where its properties can be set.

Describe alternatives you've considered

I noticed that for Strimzi we need the resources section of both the containers populated i.e spec.kafka.resources as well as spec.kafka.tlssidecar.resources in the Kafka Kind resource. Once this was set, I then deployed the backing services i.e the Metric Server and the HPA resource itself.

A major pain point is that when the HPA backs the Statefulset for example, and the threshold is breached, the HPA operator starts up a new pod from the reference stateful set pod. But this causes an issue with tls sidecar of the new pod not getting certificates loaded into it. For example :

NAME        REFERENCE                         TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
kafka-hpa   StatefulSet/kafka-cluster-kafka   116%/60%   1         5         3          15m

116/60 would mean a new POD will be spun up, which is kafka-cluster-kafka-3

kafka-cluster-kafka-3 0/2 CrashLoopBackOff 1 28s

Logs ->

Initializing service [zookeeper-2181]
/etc/tls-sidecar/kafka-brokers/kafka-cluster-kafka-3.key: No such file or directory (2)
Service [zookeeper-2181]: Failed to initialize SSL context

I assume this file is created by the Strimzi - Kafka - Operator which still has the same previous value for replica count i.e

NAME            DESIRED KAFKA REPLICAS   DESIRED ZK REPLICAS
kafka-cluster   3                        3

Notice how kafka-cluster is still 3. It needs to change as well.

For now I mitigated this by writing my own Operator which updates the Kafka Kind CRD with the new replica and then the operator takes care of the rest and everything is great. For now the HPA cannot support custom CRD's . Unless I start contributing to the HPA project, I do not think we might see any progress. So the only customization that can be done is for Strimzi to work by updating operator everytime a HPA change is seen with the new replicas.

Thanks

@sumukhballal sumukhballal changed the title [Enhancement] ... Strimzi support for Horizontal Pod Autoscaling Jul 16, 2020
@scholzj
Copy link
Member

scholzj commented Jul 16, 2020

If you wanna autoscale the Kafka brokers - and I do not think it is necessarily a good idea - you would need to do it by changing the .spec.kafka.replicas in the Kafka CR. You cannot do it just by scaling the StatefulSet. If you just scale the stateful set, you would be missing a lot of things such as the certificates which you already found out but also the operator should revert the changes back. So you need to do that in the Kafka CR.

I added the scale subresource to some of our CRs to make it easier to autoscale it. But not to Kafka since we didn't really saw any use case there with @tombentley and @samuel-hawker who IIRC were involved int he discussion. But in theory we can reconsider it to make it easier if it helps.

@samuel-hawker
Copy link
Contributor

Yes, if I recall we decided that auto-scaling was difficult due to having multiple things to scale indenpendently (both Kafka and ZooKeeper), and how you would want to ratio that (if at all).

As for use-cases, I suspect for a production cluster you wouldn't want to autoscale your brokers, more likely your topics, distributing them to more brokers etc.

Like Jakub above said, I am also interested in your use case, could you provide some details why autoscaling would be preferred to already having a larger cluster and scaling by topic size increases, partition distribution?

@scholzj
Copy link
Member

scholzj commented Jul 16, 2020

I do not think not scaling Zookeeper is an issue in most cases. But I think the rest is still an issue - although with cruise control support it is a bit easier, I would personally still not use it. But having the scale sub-resource would make it easier for people to play with it and maybe one day ...

@agrubb86
Copy link

@scholzj @samuel-hawker Even though this was closed by the original poster, I came across this issue and wanted to provide my use case as this is exactly what I was trying to find out how to do with my Strimzi cluster. I work for a mobile advertising company and our traffic is very time-dependent; we can have 5x as much traffic at 10pm as we do at 10am, all of which gets recorded through Kafka. Generally, though, the peak traffic window lasts around 6-8 hours and we were looking for a way to automatically add and remove brokers based on traffic levels (or CPU usage) to save on costs. Currently we are looking at implementing this via traffic detection that's external to Strimzi but naturally a native implementation would be much preferable.

Thanks for your time!

@scholzj
Copy link
Member

scholzj commented Feb 11, 2021

Well, you need to consider the whole lifecycle ... adding or removing brokers is the easiest part of it:

  • How do you find out that you need to scale up / down and how do you know if scaling brokers really helps (e.g. that the issue is not limited by partition count etc.)
  • How do you rebalance the cluster - move the data to / from the new brokers after scale-up / before scale-down.

I do not know how do you use Kafka. But do you really have enough time to do the rebalances while scaling down and up during your time window? The rebalance can take a long time depending on how much data you have in the cluster. It could be it works for you, but it does not sound like your window is exactly large to do this.

If you think that you really have all what is needed and the only missing thing is the scale up/down, I think we can definitely look at adding it to Strimzi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants