Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn if CPU usage is too high (>90%) #1161 #1236

Merged
merged 6 commits into from
Jan 22, 2020

Conversation

cyberw
Copy link
Collaborator

@cyberw cyberw commented Jan 20, 2020

No description provided.

@cyberw cyberw requested a review from heyman January 20, 2020 10:51
@codecov
Copy link

codecov bot commented Jan 20, 2020

Codecov Report

Merging #1236 into master will increase coverage by 0.79%.
The diff coverage is 76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1236      +/-   ##
==========================================
+ Coverage   78.64%   79.43%   +0.79%     
==========================================
  Files          20       20              
  Lines        1962     2086     +124     
  Branches      312      371      +59     
==========================================
+ Hits         1543     1657     +114     
- Misses        333      353      +20     
+ Partials       86       76      -10
Impacted Files Coverage Δ
locust/main.py 34.28% <0%> (-0.15%) ⬇️
locust/runners.py 80.9% <79.16%> (+2.6%) ⬆️
locust/core.py 87.28% <0%> (-0.43%) ⬇️
locust/contrib/fasthttp.py 92.59% <0%> (+1.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53ddf61...cb8b20d. Read the comment docs.

@cyberw
Copy link
Collaborator Author

cyberw commented Jan 20, 2020

In the future we could (should):

  • make the threshold configurable
  • add GUI support

But lets leave that for nother day...

if current_cpu > 90:
self.cpu_threshold_exceeded = True
logging.warning("Loadgen CPU usage above 90%! This may constrain your throughput and even give inconsistent response time measurements! See https://docs.locust.io/en/stable/running-locust-distributed.html for how to distribute the load over multiple CPU cores or machines")
gevent.sleep(2.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should increase the interval to something like 10 seconds? That would prevent a warning if you have a single task that does something really CPU heavy for a few seconds, and it shouldn't really matter if you get a warning after 2 or 10 seconds.

Also, do we know how much overhead (if any) Process.cpu_percent() introduces?

Copy link
Collaborator Author

@cyberw cyberw Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about increasing the interval, but decided against increasing it, because even a short spike can cause incorrect measurements. I was thinking it is better to warn a few times too many than a few times too few...

I dont know exactly how much overhead it introduces, but I did a few tests and the call itself never took more than 0.2ms to run on my 2018 MacBook pro. I think it just reading a counter somewhere...

If you want, I can bump the interval to 5 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, five seconds sounds good!

@heyman
Copy link
Member

heyman commented Jan 20, 2020

When implementing this in the web UI, I think it would make sense to add the current CPU usage for each node to the "Slaves" tab. Therefore, maybe it would make sense to add cpu_usage to locust.runners.SlaveNode, and then have all the logic for emitting warnings run in the master?

@cyberw
Copy link
Collaborator Author

cyberw commented Jan 20, 2020

When implementing this in the web UI, I think it would make sense to add the current CPU usage for each node to the "Slaves" tab. Therefore, maybe it would make sense to add cpu_usage to locust.runners.SlaveNode, and then have all the logic for emitting warnings run in the master?

Hmm... yes, having the logic on the master side makes sense.

The only problem is that if we want to log a value instead of just a flag of whether the threshold was exceeded then we may need to synchronize sending with measuring (or we might detect high usage but never send that particular metric, and thus miss a spike).

To fix this (without adding complex logic) could move the cpu checking for slaves to the heartbeat method and only run cpu_monitoring_greenlet on master + standalone?

I can make it so that we only update the cpu usage on every 5th heartbeat.

@heyman
Copy link
Member

heyman commented Jan 20, 2020

Since the heartbeat interval is less than the measurement interval (at least if we go with 5 seconds), I think it should be fine to have the cpu_monitoring_greenlet store the value on the runner instance, and then include that value in the heartbeat message.

@cyberw
Copy link
Collaborator Author

cyberw commented Jan 20, 2020

ok, I'll do it that way!

…ave and check on the master if the threshold was exceeded.
@cyberw
Copy link
Collaborator Author

cyberw commented Jan 20, 2020

LGTY? I havent re-tested distributed yet, but I'll do it before I merge..

self.current_cpu_usage = 0
self.cpu_threshold_exceeded = False
self.slave_cpu_threshold_exceeded = False
gevent.spawn(self.monitor_cpu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, since we don't store any reference to this greenlet, it won't get killed. We should probably change so that the LocalLocustRunner.greenlet is a gevent.pool.Group instance (just like MasterLocustRunner and SlaveLocustRunner) and then make sure the CPU monitor greenlet is spawned from this group.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure how to do that. LocustRunner is supposed to be a singleton, so it shouldnt really matter, right? (not that we want to be sloppy :)

If you think this is important, would you mind taking a look yourself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, it's a singleton when started normally (through main.py). But we do create multiple runner instances within the tests.

Also, it's been proposed, and I think it's a good idea, to work towards an API where one can run Locust programatically, in which case I think the design will be much cleaner if we group the spawned greenlets together (that one can join) and make sure they are killed together.

I can definitely take a look!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now changed so that we spawn all greenlets through the runner instance's greenlet attribute which is a gevent.pool.Group instance.

I also added a test for the CPU warning (and changed so that we use a constant for the montoring interval, so that I could decrease the run time of the test).

cyberw and others added 4 commits January 21, 2020 08:55
…ending whether the threshold has been exceeded or not.
…t() method.

We achieve this by spawning all greenlets using the runner’s greenlet attribute which is an instance of gevent.pool.Group().
@cyberw cyberw merged commit 680cf52 into master Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants