High disk load slow down the whole system #4865

Binyang2014 · 2020-09-01T08:35:55Z

When there is an offence write log too quickly. Such as a job simply run
yes PAI
Will cause high disk io and slow down the whole system.

The text was updated successfully, but these errors were encountered:

fanyangCS · 2020-09-01T08:41:42Z

shall we kill such offending pod?

Binyang2014 · 2020-09-01T08:48:58Z

After some investigation, it usually appears in node using HDD. For SSD, we don't notice this issue.
For our test env, the linux version is 4.15.0-1092-azure. If we using HDD. The default io scheduler is noop, a elevator scheduler. (https://en.wikipedia.org/wiki/Noop_scheduler).

There are some drawbacks for elevator scheduler, here is more details: https://www.linuxjournal.com/article/6931.
For this case, large write will block a simple read, since it only use one queue for io requests.
And write request will submit to close area in hard disk. The scheduler will always serve for write requests and cause read starving.

Here is the commit which change the default io scheduler to noop: https://git.launchpad.net/~mhcerri/ubuntu/+source/linux/+git/azure/commit/?h=azure-4.15-fsgsbase&id=75bec4e4cd32accb64f574dac31bb1910a52c19e

Binyang2014 · 2020-09-01T08:55:25Z

If we using HDD, suggest change the io scheduler to deadline or others. Here is a related issue for this: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211.
For Azure, seems we need to upgrade kernel to 4.18.x

And highly recommend using another disk to store log.

If we using SSD, I think we will not suffer this issue. Since SSD support multi-queue by default.

fanyangCS · 2020-09-01T09:00:54Z

let's mark this as a known issue then.

scarlett2018 · 2020-09-01T09:27:27Z

If we using HDD, suggest change the io scheduler to deadline or others. Here is a related issue for this: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211.
For Azure, seems we need to upgrade kernel to 4.18.x

And highly recommend using another disk to store log.

If we using SSD, I think we will not suffer this issue.

@Binyang2014 - may we list this as a best practice for PAI cluster set up? cc @hzy46 @mydmdm

fanyangCS · 2020-09-01T09:30:30Z

@scarlett2018 , we will rearchitect the log collection subsystem in the future release. This will be a tentative recommendation to mitigate the issue.

Binyang2014 · 2020-12-07T07:39:51Z

Close as we don't use log-rotate anymore

Binyang2014 self-assigned this Sep 1, 2020

Binyang2014 added the known issue label Sep 1, 2020

scarlett2018 added the doc needed label Sep 1, 2020

Binyang2014 mentioned this issue Oct 20, 2020

PAI logging pipeline enhance #4992

Closed

8 tasks

Binyang2014 closed this as completed Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High disk load slow down the whole system #4865

High disk load slow down the whole system #4865

Binyang2014 commented Sep 1, 2020

fanyangCS commented Sep 1, 2020

Binyang2014 commented Sep 1, 2020

Binyang2014 commented Sep 1, 2020 •

edited

Loading

fanyangCS commented Sep 1, 2020

scarlett2018 commented Sep 1, 2020

fanyangCS commented Sep 1, 2020

Binyang2014 commented Dec 7, 2020

High disk load slow down the whole system #4865

High disk load slow down the whole system #4865

Comments

Binyang2014 commented Sep 1, 2020

fanyangCS commented Sep 1, 2020

Binyang2014 commented Sep 1, 2020

Binyang2014 commented Sep 1, 2020 • edited Loading

fanyangCS commented Sep 1, 2020

scarlett2018 commented Sep 1, 2020

fanyangCS commented Sep 1, 2020

Binyang2014 commented Dec 7, 2020

Binyang2014 commented Sep 1, 2020 •

edited

Loading