-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multipathd should not try to acquire realtime scheduling, just run with high priority. #82
Comments
I agree that multipathd doesn't need realtime scheduling. However, multipathd is currently written like it has realtime scheduling. For instance, the directio path checker uses a one microsecond timeout to wait for events. If multipathd is not a realtime process (even with a nice value of -20 on a system with spare CPU cycles), this may end up waiting for much longer. I don't think that these are likely to cause significant problems, but they will change the timing of multipathd's actions. |
I don't think the directio checker will be an issue. 1us is extremely low anyway, the checker initially sees paths as "pending", most of the time. I guess that, in systemd environments, we'd actually be better off using systemd directives to configure this at unit startup time, e.g. To begin with, we should probably stop using RR priority 99, which is the highest available prio. This choice (which has been in place forever) looks a bit over-zealous anyway. Maybe we should just set the minimum available priority (1) for the time being, before we have sorted this out for good? |
Hi @mwilck , actually lowering the priority (still keeping it realtime), will not help with this, we still need to allocate budget to the cgroups for that to work. Do you foresee any issue if multipathd runs with normal scheduling, with a raised priority ? |
I've talked to our cgroup expert and he said: "the RHEL issue is due to kernel I think indeed that the way to go (in systemd environments) is to use unit file settings for this purpose rather than hard-coded In the meantime, you should be able to work around the issues you're seeing by using systemd's |
FWIW, rhel-7 and rhel-8 both set CONFIG_RT_GROUP_SCHED. rhel-9 does not. But nothing sets a cpu budget for multipathd. Are you doing this manually? I thought the two options were 1. all realtime process share the same realtime cpu budget and things don't need to be explicitly set up, or 2. no realtime process can run without explicitly setting up the budget? So if you are doing this manually, can't you just stop and have multipathd fail to become a realtime process? And if you aren't doing this manually, have you really seen a situation where multipathd stopped another process from being able to set up realtime scheduling? |
Hrm... Despite RHEL7 and RHEL8 both having CONFIG_RT_GROUP_SCHED, and both having
On RHEL-7 multipathd does set up RT scheduling and on RHEL-8 it fails (I've got a patch to fix the sched_setscheduler() condlog() error message to not use LOG_WARNING which gets converted into LOG_DEBUG). It looks like this is because on RHEL-7 multipathd is getting it budget directly from cpu,cpuacct where there's a budget for it:
And on RHEL-8, it's accounted as part of system.slice/multipathd.service which has no rt budget:
|
So clearly multipath has been running as a non-realtime process in RHEL-8 without any complaints. I am find with making this configurable in systemd. |
Hi @bmarzins @mwilck , making it configurable in systemd does not solve the problem as such. When multipathd is configured (through systemd environment variable, lets say), to run in realtime, we still need to allocate an explicit realtime budget for it, which is not possible without knowing the requirements from other realtime applications on the system.. Are you suggesting that the sched_setscheduler call in multipathd will remain as it is, but multipathd will run in realtime on a "best-effort" basis ..? i.e if a budget is allocated for it explicitly by the admin, then it will run in realtime, otherwise it will continue in normal scheduling policy ? This may be okay, however it may be a surprise to the system admin, as they may see that the process was running in realtime earlier but after CPU accounting is turned on it is no longer running in realtime. Maybe a info/debug level log is needed if the sched_setscheduler call fails, at least ? |
@sushmbha if's it configurable in systemd, then it is easy to stop multipathd from running in real time if it is causing problems. Like I mentioned, in RHEL8 it appears that multipathd is frequently not running as a realtime process, with no complaints. I agree we need to increase logging there. I'll post a patch today to bump the error logging there to the notice level. It is currently does log at the debug level (which means that multipathd won't log it unless you bump the verbosity all the way up). The idea is that this is a first step towards making multipathd run as a regular SCHED_OTHER process by default, once we verify that it's not causing problems. |
Actually, thinking about this more, I'm not sure that the systemd CPUSchedulingPolicy and CPUSchedulingPriority are the right way to go. If these are set, and multipathd can't get the desired policy and priority, it will fail to start. The way things currently are, multipathd will continue to work, even if it can't set the scheduling policy or priority it wants. This goes back to my earlier question @sushmbha, if CPU accounting is on, then multipathd shouldn't be running as a realtime process unless you explicitly gave it a budget, but it should still run. If CPU accounting isn't turned on, then running multipathd as a realtime process shouldn't stop other realtime processes from running. So are you explicitly giving multipathd a CPU budget, and couldn't you just stop? |
FWIW, it turns out that on my RHEL-8 system, another service was requesting CPU accounting, which turned it on and kept multipathd from running as a realtime process, but this was a standard install, so I assume that this is happening often. But it just underscores the problems with making multipathd fail to run if it can't get the scheduler policy it wants. Installing an unrelated piece of software could cause multipathd to stop working. Obviously, this wouldn't actually be a problem for RHEL, since these changes would only go into RHEL-9 and fedora, which don't set CONFIG_RT_GROUP_SCHED, but I'm not sure if other distributions would see problems. |
Hi @bmarzins currently I am explicitly allocating budget for multipathd to run in realtime. However this has the problem that it can interfere with other third party applications which also require realtime budget. I can stop allocating budget to multipathd, in which case it will run with normal scheduling. As I understand, this is okay for multipathd, as we do not expect to see issues when it runs without realtime. It just seems that the design is keeping this behavior to chance, that's the reason behind filing this issue. If the realtime requirement is not hard, then removing the sched_setscheduler call will make the behavior more consistent across platforms. |
IIUC this happens only on distributions that set |
No, multipathd doesn't have hard realtime requirements in the strict sense. It's probably sufficient just to have it run with default scheduling policy at high priority. |
Yes, and like I said, CONFIG_RT_GROUP_SCHED is not set for RHEL-9 or fedora, where these changes would land, but I'm not sure that we can say that about every distribution. If that kernel parameter is enabled, and CPU accounting is turned on in systemd, which can happen if any service requests it, then multipathd won't be able to run as a realtime service without an explicit budget. The way things currently are, multipathd will simply run as a normal SCHED_OTHER service in this case. But if we add CPUSchedulingPolicy and CPUSchedulingPriority to multipathd.service and it can't run as a realtime process, it will fail to start, and we don't get any control over the failure message. I'm fine with the idea of making whether/how we call sched_setscheduler() a compile time setting. I just don't think we should add options to multipathd.service that mean it either runs as a realtime process or not at all, when I really can't come up with a justification for why it should be a realtime process. |
Fine with me. People who want to use systemd for this kind of thing can still add the systemd directives by themselves. |
Hi @mwilck @bmarzins I think this is the best approach : #82 (comment) . To summarize my understanding, are you suggesting that in long term multipathd will do away with the sched_setscheduler call ? |
Yes, probably. As @bmarzins said, we should collect some more practical evidence, and perhaps do some targeted testing. On RHEL8, if I read the above correctly, multipathd won't be able to enable RT scheduling as soon as any other service has enabled CPU Accounting. So, assuming that a certain percentage of RHEL or CentOS customers use both multipath and other services running under RT policy, we already have some evidence that multipathd running with just normal priority works ok-ish. OTOH, a distinctive property of multipathd is the fact that it runs quietly most of the time, but becomes a crucial part of the system in certain rare situations when path failovers / failbacks are happening. It's particularly important that multipath reacts timely if paths are getting back online or new paths are added. This doesn't require true real-time behavior, but it would obviously be bad if multipathd was delayed because of higher-prio processes taking all CPU time. A worst-case scenario is like this:
In this situation it's be important that, if a path gets back online or is added/rediscovered, multipathd quickly notices and activates this path in the multipath map(s) that are qeueueing. If that happens too late or not at all, depending on configuration, the map will either stop queueing (causing IO failure on the file system level) or the OOM killer will kill some crucial service, or the system will stall, or all of the above. multipathd itself is more or less immune against thrashing because it uses mlockall() to avoid it's memory being swapped or paged out. But it could encounter priority inversion. Some higher-prio task (RT or just normal tasks with higher prio) might occupy the CPU, and this higher prio task wasn't making progress because of the thrashing situation. For example, the RT process might be busy-waiting on some pipe which would normally deliver data very quickly, but the other end of the pipe might be blocked by swap-in. Setting multipathd to max RT priority is the only way I can think of to be certain that this situation can't occur. With max RT/RR prio, multipathd would be scheduled sooner or later, even with the most evil concurrent RT processes around. With this in mind, it's very hard to say with confidence that we'd reached a sufficient amount of evidence to tell that running multipathd at normal prio is safe, unless we've tested really bad situations like the one described. That scenario is obviously extreme, but it's one of the scenarios that multipathd has been created for. At the end of the day, I suppose it's the user's decision. A well-written RT process shouldn't behave like I described above, which means that multipathd running at high priority with standard scheduling should have a chance to run, reinstate paths, and save the system even in very bad situations. Also, it's not a proven fact that RT scheduling actually makes a difference in practice: our test coverage for situations like this is not as it should be, and we don't completely avoid multipathd accessing the file system. @bmarzins, please double-check what just wrote, perhaps I'm getting something wrong here. It's not a simple matter.
Yes, this would be the current recommendation. |
Here's a new idea: multipathd could call |
@mwilck, I think your analysis is correct, although for what it's worth, RT processes constantly running on all the CPUs of a machine, even in an error case, seems pretty unlikely. In general having IO hang is more likely to keep things from running than cause them to run constantly. But bugs exist, especially in corner cases, so yeah it's possible. You LimitRTPRIO idea seems fine. I believe those limits aren't binding for root processes. For instance, by default systemd sets LimitRTPRIO to 0, but that doesn't stop sched_setscheduler from setting the prio to 99. So I assume you are suggesting that multipathd just looks at the limit, and if it's 0, it does nothing. Otherwise it calls sched_setscheduler(0, SCHED_RR, prio) where prio is smaller of rlim.rlim_max and sched_get_priority_max(SCHED_RR), since people (including us for now) could be setting LimitRTPRIO=infinity in multipathd.service. |
I wasn't aware of that, but yes, multipathd could just comply "voluntarily". |
It depends ... what @bmarzins said about RHEL8 suggests that multipathd would work just fine, most of the time, if it's running at regular priority. My worst-case scenario above can't be a avoided by using a negative nice level. I'm sure there is some grey zone in which running at higher prio might help systems survive critical situations, but I can't tell if it matters in practice. |
@sushmbha , have you seen Ben's latest patch? Are you ok with this solution? |
@sushmbha: ping! |
Hi @mwilck , the patch seems reasonable to fine tune the realtime priority of multipathd, but the problem with it is , its not dynamic. |
@sushmbha, thanks.
I'll take this as an ACK from your side.
I am not sure which problem this dynamic behavior would solve. AFAIU, it matters only for kernels with This said, we could of course add code on top of Ben's patch that tries to increase the normal priority if setting RT priority fails. I am just not sure if it will be an actual improvement. |
Hi @mwilck, |
@sushmbha, thanks again. Reading between the lines of your response, I figure that you develop a product or appliance that is based on RHEL8 or some other distro which enables I understand what you're aiming for. I just checked on a few of my systems. It's quite obvious that none of the vital system processes uses anything close to multipathd's priority. The only processes at RT prio 99 that I observe are the kernel's Thus, as observed before, multipathd's default prio is monstrously exaggerated. If we talk about priority inversion like I did above, it's much more likely that multipathd blocks other processes than vice-versa. However, I don't run actual RT systems. You do, apparently. In order to assess which prio level would be appropriate for multipathd, could you give us some examples about typical RT processes and the priorities they use? Also, can you answer my previous question? If multipathd used a lower RT prio than the other RT processes in the system, could it still cause the other processes to fail to start? |
New working hypothesis (to be discussed):
The While it's generally impossible to find a solution that fits every use case, this should fit most of the time, and would be a huge improvement about the current policy. Comments? @hreinecke, what is your take on this? |
Hi @mwilck , the proposal in #82 (comment) looks good to me.
---> I think the above covers all scenarios and also offers flexibility for different systems running different set of RT processes. Is my understanding about your proposal correct ? Regarding your question, I work on Oracle Linux distribution. Unfortunately about the RT processes running on a Oracle Linux system, I do not have an answer, it really depends on lot of factors and also the specific system (DB/cloud) etc. |
I'm pretty sure setting nice to -18 will do nothing. At least in Redhat based distributions, I believe there are multiple things keeping that from having any effect. The first is is autogroups. For kernels configured with CONFIG_SCHED_AUTOGROUP=y nice values only effect the relative priority of processes within an autogroup. Different autogroups are scheduled based on the value in /proc//autogroup (see "The autogroup feature" and "The nice value and group scheduling" in sched(7) for details). But I'm not sure that this matters either, since systemd puts multipathd in its own cgroup within a slice and the cgroups resources can trump the autogroups settings. CPUWeight is used to control the relative scheduling priority of different units in a slice. Assuming I understand systemd.resource-control(5) correctly, if CPUWeight is undefined then the autogroups priority is used. But that still leaves me with questions. Do the autogroups prio values work if CPUWeight is undefined for just that unit, or only if it's undefined for all the units is a slice? If some units in a slice define CPUWeight and some don't, I have no idea how those get weighted against each other. I can putz around a little and see if I can figure out how this all works, but setpriority() isn't going to do what we want. For another reference to all this, see: https://www.reddit.com/r/Fedora/comments/t14ojh/nice_became_a_noop_again_and_how_to_work_around/ |
Sorry for being ignorant, (open)SUSE doesn't use Back to start – I will try to summarize below what I think I've understood. Correct me if I'm wrong.
This is so complex that I don't think a generic, "dynamic" solution as requested by @sushmbha is feasible. The unit files for the RT case and for the non-RT case will necessarily look different, even if we restrict ourselves to cgroups v2 without multipathd itself can't do more than it does with Ben's current patch – try to acquire RT prio in within the configured limits, and do nothing if this fails. Wrt the unit file, we (upstream) can only provide configuration examples and documentation for running multipathd with or without RT. Distributions will have to decide what default policy they want to ship, realizing that the configuration will probably not suit every use case. I assume that in the long term, distributions will opt for using non-RT by default, because neither disabling the CPU controller nor running multipathd in the root cgroup are attractive options. |
I played around with this stuff, and at least for RHEL-based distributions, CPUWeight seems to work fine for limiting process run times if there is contention. I'll look a little more to verify that CPUWeight doesn't do something bad when multipathd switches itself to RT after it has started. But assuming having CPUWeight in the multipathd.service file doesn't hurt things if multipathd becomes a real time process, then it should be possible to set both LimitRTPrio=infinity and have multipathd either be real time or have what amounts to a negative "nice" value. Distributions and individual |
Everything looks sensible when I set CPUWeight. If LimitRTPrio=infinity is also set, multipathd becomes a real time process, and CPUWeight doesn't appear to have any effect. If LimitRTPrio=0 is also set, multipathd stays as a regular process, and CPUWeight controls how much processing time it gets if there is contention for it. I'm sending a patch to set CPUWeight=1000. |
Thanks for testing that. In the meantime I checked 5.) in my comment above (the link to the cgroups-v2 documentation, where it says "the cpu controller can only be enabled when all RT processes are in the root cgroup"). IMO this part of the kernel documentation is misleading. On SLE 15, where
Ok. Please also send one to decrease the default RT priority to something sane, like 10 or 20. |
Hi, is there any new patch available for this change ? |
Updated patches already went into https://github.com/openSUSE/multipath-tools/tree/queue The dm-devel posts and commits are: post: |
Hi @bmarzins , @mwilck, the solution looks good. One question that I have is, this solution will work for cgroupv2, because CPUWeight= directive is effective only with cgroupsv2. For a system using cgroupv1 (e.g RHEL7), CPUWeight is not effective in making multipathd run with increased priority similar to negative nice. So for this situation do you recommend using negative nice value or do you have any other suggestion ? |
@sushmbha The way RHEL7 is set up by default, using a negative nice value should work fine. |
Multipathd tries to acquire realtime scheduling priority using sched_setscheduler function. On RHEL7, when CPU accounting by systemd is enabled, it can not acquire realtime scheduling unless a realtime budget is allocated to the cgroup that multipathd runs in. This causes issues because realtime budget allocations are absolute, hence if more than one application needs to run in realtime, they need to negotiate their budgets beforehand. Consequently, on a system where multipathd is enabled and CPU accounting by systemd is on, any third party application which needs to get realtime scheduling and tries to allocate a budget for it, can potentially fail.
We think multipathd should do away with the realtime scheduling requirement and just run with high priority (negative nice). With modern schedulers, that should be adequate and we do not need realtime scheduling for multipathd.
The text was updated successfully, but these errors were encountered: