Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported CPU usage is confusing #1194

Closed
CAFxX opened this issue Aug 7, 2017 · 19 comments
Closed

Reported CPU usage is confusing #1194

CAFxX opened this issue Aug 7, 2017 · 19 comments

Comments

@CAFxX
Copy link

CAFxX commented Aug 7, 2017

Command

cf app my_app

What occurred

The CPU usage is displayed as percentage, but nowhere it's defined what the percentage refers to. Users are puzzled by this, they often assume that the range is [0,100]%, whereas in reality, the range is [0,cores*100] where cores is a number they have no real visibility/control over since it's operator-defined.

In addition, this number does not reflect the CPU quotas at all, so whether e.g. "25%" means the application has a lot of idle resources or it is actually CPU-starved depends on:

  • the number of cores on the cell (operator controlled, user visible)
  • the container size (user controlled)
  • the mem->cpu quota mapping (operator controlled, non user visible)
  • other applications on the same cell (operator controlled, non user visible)

This means that currently the displayed CPU usage is basically not "actionable" at all, i.e. it doesn't tell users much about what's going on inside their application.

Extreme example of this: two instances with CPU usage that should have the same CPU usage because they handle the same workload:

  • instance 1: 200%
  • instance 2: 50%

Which instance is working correctly? Which one is not?

  • if we assume that the two instances are on two non-overloaded cells, most likely instance 1 has some issues because for the same workload uses 4 times as much CPU
  • if we know that instance 2 is on an overloaded cell so it's actually CPU-starved by other applications (i.e. instance 2 would like to use resources over its quota) then the problem is on instance 2
cf app myapp
Showing health and status for app myapp in org myorg / space myspace as me@example.com...
name:              myapp
requested state:   started
instances:         5/5
usage:             2G x 5 instances
routes:            myapp.example.com
last uploaded:     Thu 30 Mar 10:30:01 JST 2017
stack:             cflinuxfs2
buildpack:         https://github.com/cloudfoundry/java-buildpack.git
      state     since                  cpu      memory         disk           details
#0    running   2017-07-28T11:26:45Z   22.7%    1.3G of 2G     200.6M of 1G
#1    running   2017-08-04T16:43:52Z   17.3%    1.3G of 2G     200.6M of 1G
#2    running   2017-08-02T04:15:21Z   19.5%    1.3G of 2G     200.6M of 1G
#3    running   2017-08-07T04:15:21Z   20.1%    775.1M of 2G   200.6M of 1G
#4    running   2017-08-05T16:28:55Z   155.2%   1.2G of 2G     200.6M of 1G

What you expected to occur

CPU usage should be relative to the CPU quotas assigned to the container: 100% should, therefore, map to "100% of the allocated quota". (alternatively: the allocated CPU quota should be reported together with CPU usage, similarly as is done for memory and disk)

Instances running over 100% of the allocated quota should be highlighted in red because they are using best-effort resources that are not guaranteed to be available. User documentation should be updated to make this clear.

CLI Version

6.29.0

CC API Endpoint Version

2.74

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/149996510

The labels on this github issue will be updated when the story is started.

@dkoper
Copy link

dkoper commented Aug 7, 2017

Hi @CAFxX

Thanks for this feedback.
Where in the user documentation do you think this explanation would fit well and is easy to find? I'm sure the Docs team would appreciate a pull request from you.

Red in CLI output is reserved for errors that warrant user attention. I don't see how a usage of >100% due to best-effort resources require attention?
Note that the CLI is just displaying the numbers returned by CC. I can CC @zrob on this issue, but it might be more efficient to submit a feature request or issue to CAPI's issue tracker?

Regards,
Dies Koper
CF CLI PM

@CAFxX
Copy link
Author

CAFxX commented Aug 9, 2017

Where in the user documentation do you think this explanation would fit well and is easy to find? I'm sure the Docs team would appreciate a pull request from you.

I'm not sure what should we document exactly... The point I'm trying to make is that the CPU usage number right now is not very useful from a user perspective and that we should replace it with something more actionable. Put it otherwise, this is not a documentation bug.

(if I really have to answer the question about where to put the documentation: I think I would argue that the most discoverable place for such documentation is cf app -help and the online docs - since I don't have access to the access stats of the docs I can't really argue whether priority should be given to the cli docs or the cf docs)

I don't see how a usage of >100% due to best-effort resources require attention?

Because an application that is routinely using over 100% capacity definitely needs scaling up/out or it is at risk of failing when the best-effort resources become unavailable for factors completely outside the application itself (e.g. because a different application on the same cell starts spinning).

Red in CLI output is reserved for errors that warrant user attention.

This may not be an "error" but as I argued above it should definitely warrant not just user attention but likely user action (scale out/up). In any case red was just a suggestion, the point is that it should be highlighted to suggest that it warrants user attention/action.

Note that the CLI is just displaying the numbers returned by CC. I can CC @zrob on this issue, but it might be more efficient to submit a feature request or issue to CAPI's issue tracker?

This is a long standing issue with the CF project. When something as big as a user story (e.g. this ticket) is reported it normally doesn't include in its scope a single component (e.g. only cli or only capi) but rather many components (e.g. in this case probably at least cli and capi, maybe diego as well). But AFAIK there's no single issue tracker that can be used for this, so we're forced to open the issue, at least at first, on a somewhat arbitrary component (normally the one where the problem "surfaces" the most).

So yes, eventually this will have to propagate also to capi and likely beyond. But if we don't agree that the issue is also in the CLI (again, mostly because it's where it's more likely to surface) then there's no point in opening the corresponding issues in the other components.

@CAFxX
Copy link
Author

CAFxX commented Aug 30, 2017

Updated the "what to expect" section to make it clear that "over 100%" refers to the allocated quota discussed beforehand, not about the currently displayed number

@dkoper
Copy link

dkoper commented Sep 4, 2017

I think I understand the issue(s), but not sure how to proceed. I don't have any stats on how/whether users refer to the cpu stats.
As you mentioned, currently the displayed CPU usage is basically not "actionable" at all, but in its current state it warrants user attention/action, there doesn't seem to be much we can do on the CLI side alone.
The CF Dev mailing list should be a better place to get the input from all relevant PMs and support from other users to prioritise an exploration around this?
Apologies for not taking ownership and driving this on your behalf: it seems you understand the issue much better and have ideas on how it should be, so it should be more effective if you lead the conversation with the relevant teams and users.

@XenoPhex
Copy link
Contributor

Adding a link to this thread for additional context: https://lists.cloudfoundry.org/g/cf-dev/topic/16273332

@Callisto13
Copy link

The Garden team (container runtime) have started work on a track which will:

  • change how CPU sharing is handled
  • produce a new metric which will (hopefully) make more sense in CLI metrics

You can check out our progress in our tracker by searching for the better-cpu-sharing tag.

@abbyachau
Copy link
Contributor

Hi @Callisto13 @julz please could you provide a update to this issue? Thanks.

@Callisto13
Copy link

Hi @abbyachau

From cf-deployment v3.6.0 (garden-runc-release 1.16.3) operators can deploy with operations/experimental/set-cpu-weight.yml which will turn on the new way CPU shares are calculated in Garden.

From cf-deployment v7.8.0 (garden-runc-release 1.19.0), having deployed with the same ops file mentioned above, CLI users can install the CPU Entitlement Plugin to get accurate CPU metrics for their apps.

Please note that this is all still highly experimental and the user experience could yet change.

@abbyachau
Copy link
Contributor

Thanks @Callisto13! Appreciate the update.

@CAFxX if you are not already and are able to update to the aforementioned version of cf-deployment please let us know what you think of the plugin. Let us know as well if you are happy to close this GitHub issue as there doesn't appear to be anything the CLI team can do at the moment until the plugin gains more traction for feedback and the Garden team are able to iterate on it.

@CAFxX
Copy link
Author

CAFxX commented Apr 26, 2019

I'll defer to @giner as, unfortunately, I'm not working on CF anymore.

@abbyachau
Copy link
Contributor

cc @emalm for visibility.

@gsiener please let us know us know if you've been able to try the plugin. Thanks.

@gsiener
Copy link

gsiener commented Aug 21, 2019

I think the request was intended for @giner. Thanks

@heyjcollins
Copy link
Contributor

With the GA of the v7 CLI, we're no longer actively developing against the v6 line. With an interest the overall hygiene of the CLI project we're closing this issue.
If this issue is still occurring in v7 please feel free to comment and re-open. Thank you!

@b10s
Copy link

b10s commented Jul 14, 2020

@heyjcollins hi,

is percentage in v7 in a range [0,100] or [0, cores*100] and is it relative to assigned CPU quota to an app?

@a-b
Copy link
Member

a-b commented Jul 14, 2020

The cli reports whatever CAPI https://v3-apidocs.cloudfoundry.org/version/3.86.0/index.html#the-process-stats-object reports back to us. You may want to reach out CAPI for more insights https://cloudfoundry.slack.com/archives/C07C04W4Q

@univ0298
Copy link

@heyjcollins I don't believe this should have closed with V7

@XenoPhex
Copy link
Contributor

XenoPhex commented Aug 24, 2020

@univ0298 I think what Josh is trying to say is that the CF CLI displays an unmodified version of what the CF [V3] API provides. In general, the CF CLI should not modify the contents of that data when presenting it to the user.

If an adjustment should be made, then it should be done on the API side so it will be consistent with all API clients for the foundation. So it's better to file a ticket against the Cloud Controller instead of the CLI.

@ywei2017
Copy link

+1 for the issue. It's is especially unfortunate given how long the issue has been raised, and still not addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests