Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

High disk load slow down the whole system #4865

Closed
Binyang2014 opened this issue Sep 1, 2020 · 7 comments
Closed

High disk load slow down the whole system #4865

Binyang2014 opened this issue Sep 1, 2020 · 7 comments

Comments

@Binyang2014
Copy link
Contributor

When there is an offence write log too quickly. Such as a job simply run
yes PAI
Will cause high disk io and slow down the whole system.

@Binyang2014 Binyang2014 self-assigned this Sep 1, 2020
@fanyangCS
Copy link
Contributor

shall we kill such offending pod?

@Binyang2014
Copy link
Contributor Author

After some investigation, it usually appears in node using HDD. For SSD, we don't notice this issue.
For our test env, the linux version is 4.15.0-1092-azure. If we using HDD. The default io scheduler is noop, a elevator scheduler. (https://en.wikipedia.org/wiki/Noop_scheduler).

There are some drawbacks for elevator scheduler, here is more details: https://www.linuxjournal.com/article/6931.
For this case, large write will block a simple read, since it only use one queue for io requests.
And write request will submit to close area in hard disk. The scheduler will always serve for write requests and cause read starving.

Here is the commit which change the default io scheduler to noop: https://git.launchpad.net/~mhcerri/ubuntu/+source/linux/+git/azure/commit/?h=azure-4.15-fsgsbase&id=75bec4e4cd32accb64f574dac31bb1910a52c19e

@Binyang2014
Copy link
Contributor Author

Binyang2014 commented Sep 1, 2020

If we using HDD, suggest change the io scheduler to deadline or others. Here is a related issue for this: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211.
For Azure, seems we need to upgrade kernel to 4.18.x

And highly recommend using another disk to store log.

If we using SSD, I think we will not suffer this issue. Since SSD support multi-queue by default.

@fanyangCS
Copy link
Contributor

let's mark this as a known issue then.

@scarlett2018
Copy link
Member

If we using HDD, suggest change the io scheduler to deadline or others. Here is a related issue for this: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1813211.
For Azure, seems we need to upgrade kernel to 4.18.x

And highly recommend using another disk to store log.

If we using SSD, I think we will not suffer this issue.

@Binyang2014 - may we list this as a best practice for PAI cluster set up? cc @hzy46 @mydmdm

@fanyangCS
Copy link
Contributor

@scarlett2018 , we will rearchitect the log collection subsystem in the future release. This will be a tentative recommendation to mitigate the issue.

@Binyang2014
Copy link
Contributor Author

Close as we don't use log-rotate anymore

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants