-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CPU credits drain on B1ls after update to 2.10.0.3 #2940
Comments
@gregolsky can you clarify the impacted vm is running on 2.10.0.3 and not 2.10.0.2. You mentioned this Distro and Version: Ubuntu 22.04 but logs indicate is 2.10.0.3. Can you also attach following logs to nnandigam@microsoft.com or you can run the log collector which will collect our agent logs and below command only works if 2.8.0.11 egg still in the vm
|
Sorry, I fixed the version - it's 2.10.0.3 for sure. I'll collect the info for you and send it tomorrow. We have multiple VMs affected this way - for each one CPU increased is correlated with |
@nagworld9 I sent you the files. |
@gregolsky I don't see anything different agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you compute the CPU usage for the agent, how the above graph produced. Is there any chance you combing the last computed 2.9.1.1 value to this new value of 2.10.0.3 after update? do you look at the agent cgroups reported cpu usage. If so, how the output of below command looks like and can also run pstree. I want to check no unknown process in the cgroup
As far as downgrade, today we don't have an option to downgrade.
In terms of percentage, it's not significant bump. When you say 5% is baseline, how this value calculated and who defines this? |
Azure defines this. It's the smallest instance type in B-series. B1ls
For that instance 2% is a significant bump.
I just correlated the average CPU usage bump time with the graph on azure.
It matches on at multiple instances.
pon., 9 paź 2023, 22:52 użytkownik Nageswara Nandigam <
***@***.***> napisał:
… @gregolsky <https://github.com/gregolsky> I don't see anything different
agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you
compute the CPU usage for the agent, how the above graph produced. Is there
any change combing the last computed 2.9.1.1 value to this new value of
2.10.0.3 after update?
do you look at the agent cgroups reported cpu usage. If so, how the output
of below command looks like and can also run pstree. I want to check no
unknown process in the cgroup
systemd-cgls --unit system.slice --all
pstree -p
As far as downgrade, today we don't have an option to downgrade.
after update we see CPU usage increase by 2%, which then causes CPU
credits drain on small burstable instances e.g. B1ls where 5% is the
baseline
In terms of percentage, it's not significant bump. When you say 5% is
baseline, how this value calculated and who defines this?
—
Reply to this email directly, view it on GitHub
<#2940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALYNLNTO6I2GSMLGW2HBO3X6RPZFAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTHAZTENRVHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-b-series-burstable
This bug renders B1ls unusable since when it goes over 5% for some time it
consumes all the CPU credits. These instances needs to stay below 5% for
the most time to be usable.
wt., 10 paź 2023, 05:24 użytkownik Grzegorz Lachowski <
***@***.***> napisał:
… Azure defines this. It's the smallest instance type in B-series. B1ls
For that instance 2% is a significant bump.
I just correlated the average CPU usage bump time with the graph on azure.
It matches on at multiple instances.
pon., 9 paź 2023, 22:52 użytkownik Nageswara Nandigam <
***@***.***> napisał:
> @gregolsky <https://github.com/gregolsky> I don't see anything different
> agent doing in 2.10.0.3 from the log compared to 2.9.1.1. Btw, how do you
> compute the CPU usage for the agent, how the above graph produced. Is there
> any change combing the last computed 2.9.1.1 value to this new value of
> 2.10.0.3 after update?
>
> do you look at the agent cgroups reported cpu usage. If so, how the
> output of below command looks like and can also run pstree. I want to check
> no unknown process in the cgroup
>
> systemd-cgls --unit system.slice --all
> pstree -p
>
> As far as downgrade, today we don't have an option to downgrade.
>
> after update we see CPU usage increase by 2%, which then causes CPU
> credits drain on small burstable instances e.g. B1ls where 5% is the
> baseline
>
> In terms of percentage, it's not significant bump. When you say 5% is
> baseline, how this value calculated and who defines this?
>
> —
> Reply to this email directly, view it on GitHub
> <#2940 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AALYNLNTO6I2GSMLGW2HBO3X6RPZFAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTHAZTENRVHA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@gregolsky do you happen to have other distro type and vm size and having increase in CPU usage after update? is it only on these instances? |
I have other instances (all Ubuntu 22.04 though), but here it was very easy
to notice, because it lost all CPU credits and stopped responding in a
timely fashion.
We're using latest Ubuntu 22.04 image on Azure.
wt., 10 paź 2023, 19:55 użytkownik Nageswara Nandigam <
***@***.***> napisał:
… @gregolsky <https://github.com/gregolsky> do you happen to have other
distro type and vm size and having increase in CPU usage after update? is
it only on these instances?
—
Reply to this email directly, view it on GitHub
<#2940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALYNLLMQ2FHELZF5XNSBSDX6WDZRAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVHE2TCOBRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What were the changes between 2.9 and 2.10 that may have had such an impact
on CPU usage?
wt., 10 paź 2023, 20:06 użytkownik Grzegorz Lachowski <
***@***.***> napisał:
… I have other instances (all Ubuntu 22.04 though), but here it was very
easy to notice, because it lost all CPU credits and stopped responding in a
timely fashion.
We're using latest Ubuntu 22.04 image on Azure.
wt., 10 paź 2023, 19:55 użytkownik Nageswara Nandigam <
***@***.***> napisał:
> @gregolsky <https://github.com/gregolsky> do you happen to have other
> distro type and vm size and having increase in CPU usage after update? is
> it only on these instances?
>
> —
> Reply to this email directly, view it on GitHub
> <#2940 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AALYNLLMQ2FHELZF5XNSBSDX6WDZRAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVHE2TCOBRHE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
we did rewrite around agent update, but that code path is not enabled in 2.10.0.3. The logs indicate that agent did what was expected. Is the percentage you have shown for CPU usage is based on 1cpu core as 100% right? Agent taking more than 5% out 100% on 1 core cpu? |
Not sure how much CPU agent's process alone took before, but our whole
workload in idle state (all processes on the VM together) were around 3-5%
total and it was below the baseline. After walinuxagent update it's above
the baseline.
B1ls has only 1 vcpu and once it's out of CPU credits, its initial baseline
is all you can use, so it shows 100% usage at all times then. However I'll
try to find a VM still having some credits to check how the CPU usage
patterns look. Do you have any preference on how I should collect these
metrics or perform further analysis?
I'm also open for a screensharing session if that would help.
wt., 10 paź 2023 o 21:06 Nageswara Nandigam ***@***.***>
napisał(a):
… we did rewrite around agent update, but that code path is not enabled in
2.10.0.3. The logs indicate that agent did what was expected.
Is the percentage you have shown for CPU usage is based on 1cpu core as
100% right? Agent taking more than 5% out 100% on 1 core cpu?
—
Reply to this email directly, view it on GitHub
<#2940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALYNLPQFTCBV2DJYBH3JGDX6WMDTAVCNFSM6AAAAAA5YJ2EE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJWGA3DOMBVGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@gregolsky We have identified the code line which is taking bit of time, that's causing increase in CPU usage. While we work on fix, we removed the version from artifacts repository, so that new vms won't get it. In your case, in order to rollback to 2.9.1.1, Please run the following commands. Add comments in brackets. Let me know if you need help there, I can shadow you.
|
That's great, I'm going to try it out today and get back to you.
czw., 12 paź 2023, 03:51 użytkownik Nageswara Nandigam <
***@***.***> napisał:
… @gregolsky <https://github.com/gregolsky> We have identified the code
line which is taking bit of time, that's causing increase in CPU usage.
While we work on fix, we removed the version from artifacts repository, so
that new vms won't get it. In your case, in order to rollback to 2.9.1.1,
Please run the following commands. Add comments in brackets. Let me know if
you need help there, I can shadow you.
systemctl stop walinuxagent (stop the agent service)
cd /var/lib/waagent/ (Go to agent pkg folder)
ls (you will see 2.10.0.3 packages in zip and folder like WALinuxAgent-2.10.0.3 and WALinuxAgent-2.10.0.3.zip)
rm -rf WALinuxAgent-2.10.* (remove 2.10 pkgs from the vm)
ls (make sure 2.10. got deleted)
systemctl restart walinuxagent (restart agent, so that it will pick up 2.9.1.1 as latest)
waagent --version (check the version in the output and something like this "Goal state agent: 2.9.1.1")
—
Reply to this email directly, view it on GitHub
<#2940 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALYNLO5TSLH2OFFQWYX5TDX65ELFANCNFSM6AAAAAA5YJ2EEY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We applied downgrade instructions and they worked. Thank you. |
Fixing in #2967 |
Distro and WALinuxAgent details (please complete the following information):
Additional context
Add any other context about the problem here.
Log file attached
I am afraid of sharing the log publicly. Please let me know where I can upload it in a private and secure manner.
Last few days of logs is:
The text was updated successfully, but these errors were encountered: