Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a default memory/CPU limit for operand #872

Closed
jpkrohling opened this issue Jan 23, 2020 · 17 comments
Closed

Add a default memory/CPU limit for operand #872

jpkrohling opened this issue Jan 23, 2020 · 17 comments
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested

Comments

@jpkrohling
Copy link
Contributor

jpkrohling commented Jan 23, 2020

Now that the dynamic queue size feature got merged into Jaeger, we could think about adding a memory limit to the Collector's deployment. This way, we could enable the autoscale feature by default, while having a sensible default.

I think we could start small, with 128MiB: with tracegen, it would result in a queue size of 428810, as its average span size is 313.

A sample Quarkus application, with a JAX-RS server span + JAX-RS client span + explicit span with one tag calling a second application following the same pattern yields a similar average (305), with a queue size adjusted to 440058.

@ghost ghost added the needs-triage New issues, in need of classification label Jan 23, 2020
@jpkrohling jpkrohling added enhancement New feature or request and removed needs-triage New issues, in need of classification labels Jan 23, 2020
@jpkrohling
Copy link
Contributor Author

@objectiser, @pavolloffay, @rubenvp8510 , what do you think?

@objectiser
Copy link
Contributor

Sounds like a good idea to me. Would be good to run performance/scaling tests though.

@jkandasa
Copy link
Member

@jpkrohling if the user limits memory for the collector <= 128MiB, this will lead OOM exception right?

@jpkrohling
Copy link
Contributor Author

Right, if the container's limit is set to the same size as the max queue size, then it might eventually lead to OOM. Ideally, we could set the container's limit to 256 MiB and the max queue size to 128 MiB.

@objectiser
Copy link
Contributor

@jpkrohling Thought the dynamic queue size enhancement would avoid such situations?

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Jan 24, 2020

The dynamic queue will calculate the number of items that fit into the queue based on the observed span size. The amount of memory to allocate is still up to the user, so, if the user sets 128 MiB for the queue alone, and 128 MiB for the container, then it will lack space for the other parts.

In earlier iterations of the PR, we did have a logic that calculated the memory size based on the host's memory, but we decided to remove, as it's too much magic.

@jkandasa
Copy link
Member

@jpkrohling I think we can get memory limit from resources if defined any in the CR.
If not defined we can go with default options.
Otherwise, get the defined memory limit and if dynamic queue memory not defined, then
Take some percentage of memory for the dynamic queue,
for example,
80% of memory if total memory >= 1GiB
70% of memory if total memory >= 768MiB
60% of memory if total memory >= 512MiB
55% of memory if total memory >= 256MiB
50% of memory if total memory <= 256MiB

Is it a good idea to limit the container's memory, if not defined in the CR?

@jpkrohling
Copy link
Contributor Author

Is it a good idea to limit the container's memory, if not defined in the CR?

I think that's the main question. If we do specify a default limit, like 256 MiB, then we can set the queue size in memory as well and add the auto scaler. Without a default limit, we can't know which value to set the queue size, and there's no way for the pod scaler to scale once "90% of the memory" is used up. If our default isn't suitable for a given instance, it can be explicitly set in the CR for that instance.

@pavolloffay
Copy link
Member

Now that the dynamic queue size feature got merged into Jaeger, we could think about adding a memory limit to the Collector's deployment. This way, we could enable the autoscale feature by default, while having a sensible default.

Q: If there are no default resource values specified in the CR. How much memory/CPU is assigned?

Q: Why cannot be autoscaling used without specifying the default resources?

@jpkrohling
Copy link
Contributor Author

Q: If there are no default resource values specified in the CR. How much memory/CPU is assigned?

That's something to discuss. I suggested 256MiB for memory and 200m CPU, but it's really open so far.

Q: Why cannot be autoscaling used without specifying the default resources?

It's not that it requires a default, it's more that it requires a value. When no value is set by the CR, we need to have a default to be used instead.

@pavolloffay
Copy link
Member

That's something to discuss. I suggested 256MiB for memory and 200m CPU, but it's really open so far.

I mean what is the current behavior when the resources are not defined in Jaeger CR. How much resources are allocated to jaeger?

@pavolloffay
Copy link
Member

pavolloffay commented Jan 27, 2020

It's not that it requires a default, it's more that it requires a value. When no value is set by the CR, we need to have a default to be used instead.

Then I do not understand how autoscaling work. If we expose a metric showing the number of allocated memory then we might get also a metric which shows pod's total memory. Cannot auto-scaler work with these two?

@jpkrohling
Copy link
Contributor Author

I mean what is the current behavior when the resources are not defined in Jaeger CR. How much resources are allocated to jaeger?

I don't think there's a default Kubernetes limit in that case. If the pod is consuming more memory than the node can allocate, it will just crash with an OOM. When a limit is set, the pod won't be scheduled for low-memory nodes, which is why it's always a good idea to set the minimum memory requirements for every workload.

If we expose a metric showing the number of allocated memory then we might get also a metric which shows pod's total memory. Cannot auto-scaler work with these two?

If a limit isn't set, there's nothing like "pod's total memory".

@pavolloffay
Copy link
Member

thanks @jpkrohling, This makes it clear to me. I thought that k8s will allocate a default limit if nothing is provided.

@jpkrohling jpkrohling added the question Further information is requested label Mar 9, 2020
@jpkrohling
Copy link
Contributor Author

@jkandasa do you have some free cycles to run some experiments? When preparing a demo for HPA, I found that the the optimal configuration for 10 tracegen clients was a collector with 128Mi requested, with a limit of 512Mi. The queue-size-memory was 64Mi (half of the limit). This would make the collector scale up to 10 replicas and still process data.

Ideally, we would test this scenario (or a variant) with @jkandasa's performance test framework, and check how many spans/second we can achieve without constantly dropping spans.

@jkandasa
Copy link
Member

@jpkrohling I do not have any lab machines to test this. For now, I have only OCP in the OpenStack environment.
At this moment I am a bit busy with other tasks. might take a delay from my end.

I found that the the optimal configuration for 10 tracegen clients was a collector with 128Mi requested, with a limit of 512Mi.

512Mi is the resource maximum limit?

@jpkrohling
Copy link
Contributor Author

Yes, if we specify that 512Mi is the upper boundary, Kubernetes will kill the pod once this limit is reached.

@jpkrohling jpkrohling added the good first issue Good for newcomers label Apr 12, 2021
@jpkrohling jpkrohling changed the title Add a default memory/CPU limit Add a default memory/CPU limit for operand Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants