Add a default memory/CPU limit for operand #872

jpkrohling · 2020-01-23T10:57:13Z

Now that the dynamic queue size feature got merged into Jaeger, we could think about adding a memory limit to the Collector's deployment. This way, we could enable the autoscale feature by default, while having a sensible default.

I think we could start small, with 128MiB: with tracegen, it would result in a queue size of 428810, as its average span size is 313.

A sample Quarkus application, with a JAX-RS server span + JAX-RS client span + explicit span with one tag calling a second application following the same pattern yields a similar average (305), with a queue size adjusted to 440058.

The text was updated successfully, but these errors were encountered:

jpkrohling · 2020-01-23T10:57:47Z

@objectiser, @pavolloffay, @rubenvp8510 , what do you think?

objectiser · 2020-01-23T11:07:00Z

Sounds like a good idea to me. Would be good to run performance/scaling tests though.

jkandasa · 2020-01-24T12:07:29Z

@jpkrohling if the user limits memory for the collector <= 128MiB, this will lead OOM exception right?

jpkrohling · 2020-01-24T12:22:17Z

Right, if the container's limit is set to the same size as the max queue size, then it might eventually lead to OOM. Ideally, we could set the container's limit to 256 MiB and the max queue size to 128 MiB.

objectiser · 2020-01-24T12:27:45Z

@jpkrohling Thought the dynamic queue size enhancement would avoid such situations?

jpkrohling · 2020-01-24T12:31:00Z

The dynamic queue will calculate the number of items that fit into the queue based on the observed span size. The amount of memory to allocate is still up to the user, so, if the user sets 128 MiB for the queue alone, and 128 MiB for the container, then it will lack space for the other parts.

In earlier iterations of the PR, we did have a logic that calculated the memory size based on the host's memory, but we decided to remove, as it's too much magic.

jkandasa · 2020-01-24T13:05:01Z

@jpkrohling I think we can get memory limit from resources if defined any in the CR.
If not defined we can go with default options.
Otherwise, get the defined memory limit and if dynamic queue memory not defined, then
Take some percentage of memory for the dynamic queue,
for example,
80% of memory if total memory >= 1GiB
70% of memory if total memory >= 768MiB
60% of memory if total memory >= 512MiB
55% of memory if total memory >= 256MiB
50% of memory if total memory <= 256MiB

Is it a good idea to limit the container's memory, if not defined in the CR?

jpkrohling · 2020-01-24T13:09:04Z

Is it a good idea to limit the container's memory, if not defined in the CR?

I think that's the main question. If we do specify a default limit, like 256 MiB, then we can set the queue size in memory as well and add the auto scaler. Without a default limit, we can't know which value to set the queue size, and there's no way for the pod scaler to scale once "90% of the memory" is used up. If our default isn't suitable for a given instance, it can be explicitly set in the CR for that instance.

pavolloffay · 2020-01-27T11:00:28Z

Now that the dynamic queue size feature got merged into Jaeger, we could think about adding a memory limit to the Collector's deployment. This way, we could enable the autoscale feature by default, while having a sensible default.

Q: If there are no default resource values specified in the CR. How much memory/CPU is assigned?

Q: Why cannot be autoscaling used without specifying the default resources?

jpkrohling · 2020-01-27T12:32:15Z

Q: If there are no default resource values specified in the CR. How much memory/CPU is assigned?

That's something to discuss. I suggested 256MiB for memory and 200m CPU, but it's really open so far.

Q: Why cannot be autoscaling used without specifying the default resources?

It's not that it requires a default, it's more that it requires a value. When no value is set by the CR, we need to have a default to be used instead.

pavolloffay · 2020-01-27T13:09:58Z

That's something to discuss. I suggested 256MiB for memory and 200m CPU, but it's really open so far.

I mean what is the current behavior when the resources are not defined in Jaeger CR. How much resources are allocated to jaeger?

pavolloffay · 2020-01-27T13:10:33Z

It's not that it requires a default, it's more that it requires a value. When no value is set by the CR, we need to have a default to be used instead.

Then I do not understand how autoscaling work. If we expose a metric showing the number of allocated memory then we might get also a metric which shows pod's total memory. Cannot auto-scaler work with these two?

jpkrohling · 2020-01-27T13:22:42Z

I mean what is the current behavior when the resources are not defined in Jaeger CR. How much resources are allocated to jaeger?

I don't think there's a default Kubernetes limit in that case. If the pod is consuming more memory than the node can allocate, it will just crash with an OOM. When a limit is set, the pod won't be scheduled for low-memory nodes, which is why it's always a good idea to set the minimum memory requirements for every workload.

If we expose a metric showing the number of allocated memory then we might get also a metric which shows pod's total memory. Cannot auto-scaler work with these two?

If a limit isn't set, there's nothing like "pod's total memory".

pavolloffay · 2020-01-27T15:34:10Z

thanks @jpkrohling, This makes it clear to me. I thought that k8s will allocate a default limit if nothing is provided.

jpkrohling · 2020-03-09T15:59:49Z

@jkandasa do you have some free cycles to run some experiments? When preparing a demo for HPA, I found that the the optimal configuration for 10 tracegen clients was a collector with 128Mi requested, with a limit of 512Mi. The queue-size-memory was 64Mi (half of the limit). This would make the collector scale up to 10 replicas and still process data.

Ideally, we would test this scenario (or a variant) with @jkandasa's performance test framework, and check how many spans/second we can achieve without constantly dropping spans.

jkandasa · 2020-03-10T03:04:34Z

@jpkrohling I do not have any lab machines to test this. For now, I have only OCP in the OpenStack environment.
At this moment I am a bit busy with other tasks. might take a delay from my end.

I found that the the optimal configuration for 10 tracegen clients was a collector with 128Mi requested, with a limit of 512Mi.

512Mi is the resource maximum limit?

jpkrohling · 2020-03-10T08:28:02Z

Yes, if we specify that 512Mi is the upper boundary, Kubernetes will kill the pod once this limit is reached.

ghost added the needs-triage New issues, in need of classification label Jan 23, 2020

jpkrohling added enhancement New feature or request and removed needs-triage New issues, in need of classification labels Jan 23, 2020

jpkrohling added the question Further information is requested label Mar 9, 2020

jpkrohling added the good first issue Good for newcomers label Apr 12, 2021

jpkrohling changed the title ~~Add a default memory/CPU limit~~ Add a default memory/CPU limit for operand Jul 21, 2021

jpkrohling mentioned this issue Jul 21, 2021

Add a default CPU/memory requests and limits to the operator #1514

Closed

kolomiets mentioned this issue Apr 15, 2022

Size Jaeger tasks properly kolomiets/quickstart-jaeger#16

Closed

jpkrohling closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a default memory/CPU limit for operand #872

Add a default memory/CPU limit for operand #872

jpkrohling commented Jan 23, 2020 •

edited

Loading

jpkrohling commented Jan 23, 2020

objectiser commented Jan 23, 2020

jkandasa commented Jan 24, 2020

jpkrohling commented Jan 24, 2020

objectiser commented Jan 24, 2020

jpkrohling commented Jan 24, 2020 •

edited

Loading

jkandasa commented Jan 24, 2020

jpkrohling commented Jan 24, 2020

pavolloffay commented Jan 27, 2020

jpkrohling commented Jan 27, 2020

pavolloffay commented Jan 27, 2020

pavolloffay commented Jan 27, 2020 •

edited

Loading

jpkrohling commented Jan 27, 2020

pavolloffay commented Jan 27, 2020

jpkrohling commented Mar 9, 2020

jkandasa commented Mar 10, 2020

jpkrohling commented Mar 10, 2020

Add a default memory/CPU limit for operand #872

Add a default memory/CPU limit for operand #872

Comments

jpkrohling commented Jan 23, 2020 • edited Loading

jpkrohling commented Jan 23, 2020

objectiser commented Jan 23, 2020

jkandasa commented Jan 24, 2020

jpkrohling commented Jan 24, 2020

objectiser commented Jan 24, 2020

jpkrohling commented Jan 24, 2020 • edited Loading

jkandasa commented Jan 24, 2020

jpkrohling commented Jan 24, 2020

pavolloffay commented Jan 27, 2020

jpkrohling commented Jan 27, 2020

pavolloffay commented Jan 27, 2020

pavolloffay commented Jan 27, 2020 • edited Loading

jpkrohling commented Jan 27, 2020

pavolloffay commented Jan 27, 2020

jpkrohling commented Mar 9, 2020

jkandasa commented Mar 10, 2020

jpkrohling commented Mar 10, 2020

jpkrohling commented Jan 23, 2020 •

edited

Loading

jpkrohling commented Jan 24, 2020 •

edited

Loading

pavolloffay commented Jan 27, 2020 •

edited

Loading